Targeted Risk Understanding \& Scoring Technology (TRUST)
A Data Science Project on Home Credit Risk Analysis
Image obtained from LinkedIn
# @title TRUST
from IPython.display import IFrame
# Embed YouTube video
IFrame(src="https://www.youtube.com/embed/fBDyLng_aJw?si=WG24rSzmf7bYAmnS",
width=1680,
height=945)
Team TRUIST¶
This project is a collaborative effort by the team TRUIST for the University of Maryland as part of the Principles of Data Science course. All code and prose have been developed by the team members: Arunbh Yashaswi, Swattik Maiti, Eniyan Ezhilan, Ajaykumar Balakannan, and Ritik Pratap Singh. The knowledge and skills used to complete this study were acquired through the course, academic materials, and credible internet sources.
The study conducted in this project concerns credit risk analysis in the financial domain. The purpose is to provide an impartial and data-driven exploration into the factors affecting loan defaults. This project does not promote any specific viewpoint or financial institution but aims to contribute to a better understanding of credit risk assessment.
Final Report:¶
Trust Home Credit Loan Default Analysis Final Report
Project Source Code:¶
Introduction¶
In today’s rapidly evolving financial landscape, accurately predicting credit risk is a significant challenge faced by financial institutions worldwide. Credit risk refers to the possibility of a borrower failing to meet their repayment obligations, leading to financial losses for lenders. To mitigate this risk, credit scoring systems have traditionally relied on historical credit data, such as repayment history and credit utilization. While these methods have proven effective for many borrowers, they fall short when evaluating first-time borrowers or individuals with limited credit histories. This limitation creates significant gaps in financial inclusion, leaving many deserving individuals without access to credit. The purpose of this project is to analyze loan defaults using a comprehensive dataset provided by Home Credit Group. This dataset captures various financial, demographic, and behavioral attributes of borrowers, offering an opportunity to go beyond traditional credit scoring metrics. By leveraging advanced data science techniques, we aim to uncover meaningful insights and develop predictive models that address the limitations of existing credit risk frameworks.
Note Comprehensive Documentation and Final Dataset¶
Throughout this project, we have tackled multiple stages of data preparation, feature engineering, exploratory analysis, hypothesis testing, and machine learning model development. Each stage required meticulous work and detailed explanations. To keep everything organized and easy to follow, we have provided links to the relevant notebooks in each section of this report. These notebooks are hosted on our GitHub repository and serve as standalone guides for each process we implemented.
👉 Explore the detailed notebooks on our GitHub Repository.
Why Everything is Not in One Notebook¶
Given the complexity and breadth of this project, it was not practical to consolidate all the work into a single notebook. Each stage of the process—data cleaning, exploratory data analysis, feature engineering, feature selection, and model development—required detailed code, extensive computation, and specific explanations tailored to that step.
Final Training Dataset: ultimate_op_dataset¶
After completing all the steps of data cleaning, feature engineering, and feature selection, we created the ultimate_op_dataset, which represents the final, consolidated dataset used for training our machine learning models.
Initial Hypothesis¶
We aim to identify key factors that influence loan defaults and assess how effectively machine learning models can predict borrower risk. This study will cover the entire data science lifecycle, from data collection and preprocessing to feature engineering, model development, and final evaluation. Our goal is to create a robust framework that financial institutions can utilize to make informed credit decisions while addressing gaps in existing scoring models. Specifically, we will investigate the following:
Key Predictors: Analyze the relative importance of features such as amount financed, delinquency, loan tenure, past loan perfomance, employment stability, and income in predicting loan defaults.
- Example: Applicants who have a high loan amount relative to their income or a history of delinquency are more likely to default. Similarly, borrowers with shorter loan tenures and consistent past loan repayment behavior may exhibit lower default risks, while those with unstable employment histories or irregular income patterns are at higher risk of non-repayment..
- Example: Applicants who have a high loan amount relative to their income or a history of delinquency are more likely to default. Similarly, borrowers with shorter loan tenures and consistent past loan repayment behavior may exhibit lower default risks, while those with unstable employment histories or irregular income patterns are at higher risk of non-repayment..
Model Performance: Compare the effectiveness of machine learning models like logistic regression, random forests, and gradient boosting in predicting loan default probabilities.
- Example: Models trained on non-traditional features (e.g., behavioral data) may outperform those relying solely on traditional financial metrics.
This project seeks to improve the accuracy of credit scoring, reduce lending risks, and foster financial inclusion by leveraging modern data science techniques.
Why Is This Important?¶
Credit risk assessment plays a crucial role in maintaining financial stability while promoting responsible lending practices. The landscape of credit evaluation is shifting, with traditional methods often failing to account for the complexities of modern borrower profiles. This study aims to offer a data-driven approach to understanding the factors that lead to loan defaults, helping financial institutions refine their risk models and extend credit responsibly. By improving predictive accuracy, this analysis can contribute to broader financial inclusion, reduce default rates, and foster greater trust between lenders and borrowers.
Importing Relevant libraries¶
We begin by importing the necessary Python libraries for data manipulation, analysis, visualization, and machine learning. These libraries will enable efficient data exploration and help us build robust predictive models.
!pip install dask[dataframe]
!pip install tensorflow-addons
Requirement already satisfied: dask[dataframe] in /usr/local/lib/python3.10/dist-packages (2024.10.0) Requirement already satisfied: click>=8.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (8.1.7) Requirement already satisfied: cloudpickle>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (3.1.0) Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (2024.10.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (24.2) Requirement already satisfied: partd>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (1.4.2) Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (6.0.2) Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (0.12.1) Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (8.5.0) Requirement already satisfied: pandas>=2.0 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (2.2.2) Requirement already satisfied: dask-expr<1.2,>=1.1 in /usr/local/lib/python3.10/dist-packages (from dask[dataframe]) (1.1.16) Requirement already satisfied: pyarrow>=14.0.1 in /usr/local/lib/python3.10/dist-packages (from dask-expr<1.2,>=1.1->dask[dataframe]) (17.0.0) Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[dataframe]) (3.21.0) Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0->dask[dataframe]) (2024.2) Requirement already satisfied: locket in /usr/local/lib/python3.10/dist-packages (from partd>=1.4.0->dask[dataframe]) (1.0.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0->dask[dataframe]) (1.17.0) Collecting tensorflow-addons Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow-addons) (24.2) Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons) Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB) Downloading tensorflow_addons-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (611 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.8/611.8 kB 9.0 MB/s eta 0:00:00 Downloading typeguard-2.13.3-py3-none-any.whl (17 kB) Installing collected packages: typeguard, tensorflow-addons Attempting uninstall: typeguard Found existing installation: typeguard 4.4.1 Uninstalling typeguard-4.4.1: Successfully uninstalled typeguard-4.4.1 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. inflect 7.4.0 requires typeguard>=4.0.1, but you have typeguard 2.13.3 which is incompatible. Successfully installed tensorflow-addons-0.23.0 typeguard-2.13.3
# Standard Python libraries for data manipulation and analysis
import numpy as np
import pandas as pd
# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# Libraries for data preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
# Libraries for machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix
# Advanced machine learning libraries
import lightgbm as lgbm
import xgboost as xgb
from xgboost import XGBClassifier
# Libraries for statistical analysis
from scipy.stats import *
# Additional utilities
from collections import Counter
from os import truncate
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
import os
# Libraries for big data processing (if needed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression as SparkLogisticRegression, RandomForestClassifier as SparkRandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
%matplotlib inline
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Data Loading¶
Obtaining Data from Kaggle¶
The dataset for this project was sourced from the Home Credit Default Risk competition on Kaggle. This dataset provides detailed information about loan applicants, their financial and demographic profiles, and their repayment histories. It serves as the foundation for building predictive models to assess credit risk.
After downloading the dataset from Kaggle, the files were uploaded to Google Drive for seamless integration with the Google Colab environment. This setup allows for efficient data access and processing directly in the notebook.
We then loaded each file into separate pandas DataFrames for analysis. This approach ensures modularity and allows for targeted preprocessing of individual tables. Below is the code snippet illustrating how the data was imported:
def data_loader(filepath: str) -> dict:
try:
# Dictionary to store loaded DataFrames
dataframes = {}
# List of file names to load
files_to_load = [
"application_train.csv",
"bureau.csv",
"bureau_balance.csv",
"credit_card_balance.csv",
"installments_payments.csv",
"POS_CASH_balance.csv",
"previous_application.csv",
]
# Load each file and add to the dictionary
for file in files_to_load:
df_name = file.split(".")[0]
dataframes[df_name] = pd.read_csv(os.path.join(filepath, file))
print(f"Pandas DataFrame '{df_name}' loaded successfully.")
return dataframes
except FileNotFoundError as fnf_error:
print(f"File not found: {fnf_error}")
except pd.errors.EmptyDataError as ede:
print(f"Empty data error: {ede}")
except Exception as e:
print(f"An error occurred while loading data: {e}")
return None
# Mount Google Drive
drive.mount('/content/drive')
data_path = '/content/drive/MyDrive/home-credit-default-risk/'
# Use the data_loader function to load all datasets
dataframes = data_loader(data_path)
# Access individual DataFrames
if dataframes:
appl_train = dataframes["application_train"]
bureau_df = dataframes["bureau"]
bureau_balance_df = dataframes["bureau_balance"]
credit_card_df = dataframes["credit_card_balance"]
installments_df = dataframes["installments_payments"]
pos_cash_df = dataframes["POS_CASH_balance"]
previous_application_df = dataframes["previous_application"]
# Display the first few rows of one of the tables
print(appl_train.head())
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Pandas DataFrame 'application_train' loaded successfully.
Pandas DataFrame 'bureau' loaded successfully.
Pandas DataFrame 'bureau_balance' loaded successfully.
Pandas DataFrame 'credit_card_balance' loaded successfully.
Pandas DataFrame 'installments_payments' loaded successfully.
Pandas DataFrame 'POS_CASH_balance' loaded successfully.
Pandas DataFrame 'previous_application' loaded successfully.
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR \
0 100002 1 Cash loans M N
1 100003 0 Cash loans F N
2 100004 0 Revolving loans M Y
3 100006 0 Cash loans F N
4 100007 0 Cash loans M N
FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY \
0 Y 0 202500.0 406597.5 24700.5
1 N 0 270000.0 1293502.5 35698.5
2 Y 0 67500.0 135000.0 6750.0
3 Y 0 135000.0 312682.5 29686.5
4 Y 0 121500.0 513000.0 21865.5
... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
0 ... 0 0 0 0
1 ... 0 0 0 0
2 ... 0 0 0 0
3 ... 0 0 0 0
4 ... 0 0 0 0
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 0.0 1.0
1 0.0 0.0
2 0.0 0.0
3 NaN NaN
4 0.0 0.0
[5 rows x 122 columns]
Basic Data Exploration¶
## Some utility functions
#Table shape
def pandas_df_shape(df) -> None:
"""
This function is responsible for accepting a pysaprk dataframe and printing out the number of rows and columns of the dataset
"""
num_rows = df.shape[0] # Get number of rows
num_cols = df.shape[1] # Get number of columns
# Print the shape (rows, columns)
print(f"Shape of DataFrame : ({num_rows}, {num_cols})")
#funct to check skewness in each column
def check_skewness(df: pd.DataFrame, col_name: str) -> None:
print(df[col_name].value_counts())
percentage_counts = df[col_name].value_counts(normalize=True) * 100
print(percentage_counts)
#func to check fillrate for each col of whole dataset
def calculate_fill_rate(df: pd.DataFrame) -> pd.DataFrame:
#check fillrate
fill_rates = {
'Column Name': [],
'Fill Rate (%)': []
}
total_rows = df.shape[0]
for column in df.columns:
non_null_count = df[column].notnull().sum()
fill_rate = (non_null_count / total_rows) * 100
fill_rates['Column Name'].append(column)
fill_rates['Fill Rate (%)'].append(fill_rate)
fill_rate_df = pd.DataFrame(fill_rates)
# # Save the fill rate DataFrame to a CSV file
# fill_rate_df.to_csv(output_file, index=False)
return pd.DataFrame(fill_rate_df)
# func to make a plot to check distribution of a variable
def plot_distribution(df: pd.DataFrame, target_variable : str) -> None:
plt.figure(figsize=(15, 6))
# Histogram
plt.subplot(1, 3, 1)
sns.histplot(df[target_variable], bins=30, kde=False, color='blue', edgecolor='black')
plt.title(f'Histogram of {target_variable}')
plt.xlabel(target_variable)
plt.ylabel('Frequency')
# KDE Plot
plt.subplot(1, 3, 2)
sns.kdeplot(df[target_variable], color='red', fill=True)
plt.title(f'KDE Plot of {target_variable}')
plt.xlabel(target_variable)
plt.ylabel('Density')
# Box Plot
plt.subplot(1, 3, 3)
sns.boxplot(x=df[target_variable], color='lightgray')
plt.title(f'Box Plot of {target_variable}')
plt.xlabel(target_variable)
plt.tight_layout()
plt.show()
The above utility functions simplify data exploration by analyzing DataFrame shape, skewness, fill rates, and variable distributions. These tools provide quick insights into data quality and structure, ensuring efficient preparation for analysis.
# List of tables to analyze
dataframes_list = [
appl_train,
bureau_df,
bureau_balance_df,
credit_card_df,
installments_df,
pos_cash_df,
previous_application_df
]
# Example: Use utility functions on all DataFrames
for idx, df in enumerate(dataframes_list, start=1):
print(f"\n--- Table {idx}: Analysis ---\n")
# Check shape of the DataFrame
pandas_df_shape(df)
# Calculate fill rates for the DataFrame
print("\nFill Rates:")
fill_rate_df = calculate_fill_rate(df)
print(fill_rate_df)
# Example column analysis (e.g., target column or specific feature)
if 'TARGET' in df.columns:
print("\nDistribution of TARGET variable:")
plot_distribution(df, 'TARGET')
# Skewness check (Example: a categorical column)
if 'NAME_CONTRACT_TYPE' in df.columns:
print("\nSkewness in NAME_CONTRACT_TYPE:")
check_skewness(df, 'NAME_CONTRACT_TYPE')
--- Table 1: Analysis ---
Shape of DataFrame : (307511, 122)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_CURR 100.000000
1 TARGET 100.000000
2 NAME_CONTRACT_TYPE 100.000000
3 CODE_GENDER 100.000000
4 FLAG_OWN_CAR 100.000000
.. ... ...
117 AMT_REQ_CREDIT_BUREAU_DAY 86.498369
118 AMT_REQ_CREDIT_BUREAU_WEEK 86.498369
119 AMT_REQ_CREDIT_BUREAU_MON 86.498369
120 AMT_REQ_CREDIT_BUREAU_QRT 86.498369
121 AMT_REQ_CREDIT_BUREAU_YEAR 86.498369
[122 rows x 2 columns]
Distribution of TARGET variable:
Skewness in NAME_CONTRACT_TYPE:
NAME_CONTRACT_TYPE
Cash loans 278232
Revolving loans 29279
Name: count, dtype: int64
NAME_CONTRACT_TYPE
Cash loans 90.478715
Revolving loans 9.521285
Name: proportion, dtype: float64
--- Table 2: Analysis ---
Shape of DataFrame : (1716428, 17)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_CURR 100.000000
1 SK_ID_BUREAU 100.000000
2 CREDIT_ACTIVE 100.000000
3 CREDIT_CURRENCY 100.000000
4 DAYS_CREDIT 100.000000
5 CREDIT_DAY_OVERDUE 100.000000
6 DAYS_CREDIT_ENDDATE 93.850427
7 DAYS_ENDDATE_FACT 63.083042
8 AMT_CREDIT_MAX_OVERDUE 34.486736
9 CNT_CREDIT_PROLONG 100.000000
10 AMT_CREDIT_SUM 99.999243
11 AMT_CREDIT_SUM_DEBT 84.988068
12 AMT_CREDIT_SUM_LIMIT 65.522585
13 AMT_CREDIT_SUM_OVERDUE 100.000000
14 CREDIT_TYPE 100.000000
15 DAYS_CREDIT_UPDATE 100.000000
16 AMT_ANNUITY 28.526510
--- Table 3: Analysis ---
Shape of DataFrame : (27299925, 3)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_BUREAU 100.0
1 MONTHS_BALANCE 100.0
2 STATUS 100.0
--- Table 4: Analysis ---
Shape of DataFrame : (3840312, 23)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_PREV 100.000000
1 SK_ID_CURR 100.000000
2 MONTHS_BALANCE 100.000000
3 AMT_BALANCE 100.000000
4 AMT_CREDIT_LIMIT_ACTUAL 100.000000
5 AMT_DRAWINGS_ATM_CURRENT 80.475128
6 AMT_DRAWINGS_CURRENT 100.000000
7 AMT_DRAWINGS_OTHER_CURRENT 80.475128
8 AMT_DRAWINGS_POS_CURRENT 80.475128
9 AMT_INST_MIN_REGULARITY 92.051792
10 AMT_PAYMENT_CURRENT 80.001937
11 AMT_PAYMENT_TOTAL_CURRENT 100.000000
12 AMT_RECEIVABLE_PRINCIPAL 100.000000
13 AMT_RECIVABLE 100.000000
14 AMT_TOTAL_RECEIVABLE 100.000000
15 CNT_DRAWINGS_ATM_CURRENT 80.475128
16 CNT_DRAWINGS_CURRENT 100.000000
17 CNT_DRAWINGS_OTHER_CURRENT 80.475128
18 CNT_DRAWINGS_POS_CURRENT 80.475128
19 CNT_INSTALMENT_MATURE_CUM 92.051792
20 NAME_CONTRACT_STATUS 100.000000
21 SK_DPD 100.000000
22 SK_DPD_DEF 100.000000
--- Table 5: Analysis ---
Shape of DataFrame : (13605401, 8)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_PREV 100.000000
1 SK_ID_CURR 100.000000
2 NUM_INSTALMENT_VERSION 100.000000
3 NUM_INSTALMENT_NUMBER 100.000000
4 DAYS_INSTALMENT 100.000000
5 DAYS_ENTRY_PAYMENT 99.978648
6 AMT_INSTALMENT 100.000000
7 AMT_PAYMENT 99.978648
--- Table 6: Analysis ---
Shape of DataFrame : (10001358, 8)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_PREV 100.000000
1 SK_ID_CURR 100.000000
2 MONTHS_BALANCE 100.000000
3 CNT_INSTALMENT 99.739325
4 CNT_INSTALMENT_FUTURE 99.739165
5 NAME_CONTRACT_STATUS 100.000000
6 SK_DPD 100.000000
7 SK_DPD_DEF 100.000000
--- Table 7: Analysis ---
Shape of DataFrame : (1670214, 37)
Fill Rates:
Column Name Fill Rate (%)
0 SK_ID_PREV 100.000000
1 SK_ID_CURR 100.000000
2 NAME_CONTRACT_TYPE 100.000000
3 AMT_ANNUITY 77.713335
4 AMT_APPLICATION 100.000000
5 AMT_CREDIT 99.999940
6 AMT_DOWN_PAYMENT 46.363520
7 AMT_GOODS_PRICE 76.918227
8 WEEKDAY_APPR_PROCESS_START 100.000000
9 HOUR_APPR_PROCESS_START 100.000000
10 FLAG_LAST_APPL_PER_CONTRACT 100.000000
11 NFLAG_LAST_APPL_IN_DAY 100.000000
12 RATE_DOWN_PAYMENT 46.363520
13 RATE_INTEREST_PRIMARY 0.356302
14 RATE_INTEREST_PRIVILEGED 0.356302
15 NAME_CASH_LOAN_PURPOSE 100.000000
16 NAME_CONTRACT_STATUS 100.000000
17 DAYS_DECISION 100.000000
18 NAME_PAYMENT_TYPE 100.000000
19 CODE_REJECT_REASON 100.000000
20 NAME_TYPE_SUITE 50.880246
21 NAME_CLIENT_TYPE 100.000000
22 NAME_GOODS_CATEGORY 100.000000
23 NAME_PORTFOLIO 100.000000
24 NAME_PRODUCT_TYPE 100.000000
25 CHANNEL_TYPE 100.000000
26 SELLERPLACE_AREA 100.000000
27 NAME_SELLER_INDUSTRY 100.000000
28 CNT_PAYMENT 77.713634
29 NAME_YIELD_GROUP 100.000000
30 PRODUCT_COMBINATION 99.979284
31 DAYS_FIRST_DRAWING 59.701871
32 DAYS_FIRST_DUE 59.701871
33 DAYS_LAST_DUE_1ST_VERSION 59.701871
34 DAYS_LAST_DUE 59.701871
35 DAYS_TERMINATION 59.701871
36 NFLAG_INSURED_ON_APPROVAL 59.701871
Skewness in NAME_CONTRACT_TYPE:
NAME_CONTRACT_TYPE
Cash loans 747553
Consumer loans 729151
Revolving loans 193164
XNA 346
Name: count, dtype: int64
NAME_CONTRACT_TYPE
Cash loans 44.757917
Consumer loans 43.656142
Revolving loans 11.565225
XNA 0.020716
Name: proportion, dtype: float64
This analysis provides a detailed overview of multiple datasets, highlighting their structure, fill rates, and distribution trends. Each table includes the shape of the DataFrame, with the number of rows and columns clearly defined. Fill rates for individual columns are calculated, identifying areas with missing data. Key categorical variables, such as NAME_CONTRACT_TYPE, are analyzed for skewness, providing insights into data imbalances. The breakdown allows for a focused understanding of data quality and distribution, essential for subsequent preprocessing and modeling steps
Data Curation and Cleaning¶
fill_rate_df_app_train = calculate_fill_rate(appl_train)
fill_rate_df_app_train
| Column Name | Fill Rate (%) | |
|---|---|---|
| 0 | SK_ID_CURR | 100.000000 |
| 1 | TARGET | 100.000000 |
| 2 | NAME_CONTRACT_TYPE | 100.000000 |
| 3 | CODE_GENDER | 100.000000 |
| 4 | FLAG_OWN_CAR | 100.000000 |
| ... | ... | ... |
| 117 | AMT_REQ_CREDIT_BUREAU_DAY | 86.498369 |
| 118 | AMT_REQ_CREDIT_BUREAU_WEEK | 86.498369 |
| 119 | AMT_REQ_CREDIT_BUREAU_MON | 86.498369 |
| 120 | AMT_REQ_CREDIT_BUREAU_QRT | 86.498369 |
| 121 | AMT_REQ_CREDIT_BUREAU_YEAR | 86.498369 |
122 rows × 2 columns
appl_train.drop_duplicates(inplace=True)
4.1 Data Imputation¶
The Plan is to take only the columns which have > 50% fillrate but < 100% fillrate and then find a way to impute the missing values. Right now I am not going to bother dealing with the columns with <= 50 % fillrate and will just drop them.
In this code, I am only concerned with imputing the important variables like AMT_ANNUITY, AMT_GOODS_PRICE, NAME_TYPE_SUITE, OCCUPATION_TYPE, CNT_FAM_MEMBERS (handling them are part of my task list in the project) I am planning to build machine learning models to find the optimal values for the missing cases. Since we have about 300,000+ rows in appl_train dataset, this gives me the confidence that building imputation models are feasible.
2 Possible approaches for the above¶
Approach 1 - Even after dropping columsn with <= 50% fillrate, their will still be lots of columns with > 50% but less than 100% fillrate. One possible way of fealing with these columns (for imputation model) is just dropping them. I can brign these columns back while building the main models. This will give us the freedom to use any machine learning model for imputation, even the likes of linear and logistic regression etc which are very sensitive to missing data.
Approach 2 - Keep all the columns intact, dont drop any columns (other than <= 50 % fillrate columns) . Proceed with the normal flow of machine learning modelling but only restrict myself to models that can handle missing data on its own like XGB etc.
# dropping columns with <= 50% fillrate
# make a list of all columns with <= 50% fillrate
low_fill_rate_columns = fill_rate_df_app_train[fill_rate_df_app_train['Fill Rate (%)'] <= 50.00]['Column Name'].tolist()
low_fill_rate_columns
['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE']
# dropping these columns from appl_train datafram
appl_train_refined = appl_train.drop(columns = low_fill_rate_columns)
appl_train_refined.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 81 columns
4.2 EDA for the imputation Model¶
My first imputation model will be focused on the target Variabe - AMT_ANNUITY
AMT_ANNUITY represents - Loan Annuity for the particular loan application
pandas_df_shape(appl_train_refined)
Shape of DataFrame : (307511, 81)
# how many null values are there for the target variable
null_count = appl_train_refined['AMT_ANNUITY'].isnull().sum()
print(f"Number of null values in 'AMT_ANNUITY': {null_count}")
Number of null values in 'AMT_ANNUITY': 12
So the number of null values are just 12 for a dataset of 300,000+. Doing imputation for this seems like an over kill, but another analysis proved that this variable was of utmost importance for predicting defaults hence proceeding with the same.
Let us try to get an idea of the distribution of the target variable
appl_train_refined['AMT_ANNUITY'].describe()
| AMT_ANNUITY | |
|---|---|
| count | 307499.000000 |
| mean | 27108.573909 |
| std | 14493.737315 |
| min | 1615.500000 |
| 25% | 16524.000000 |
| 50% | 24903.000000 |
| 75% | 34596.000000 |
| max | 258025.500000 |
plot_distribution(appl_train_refined,'AMT_ANNUITY')
4.3 Inference about the Distribution of the Target Variable - AMT_ANNUITY¶
The variable AMT_ANNUITY is not normally distributed.
Histogram: The data shows a right-skewed (positively skewed) distribution. Most of the values are concentrated toward the lower end (closer to 0), with fewer occurrences of higher values, which is typical of skewness.
KDE Plot: The Kernel Density Estimate (KDE) confirms the right-skew, with a sharp peak around the lower values and a long tail extending toward higher values. This long tail suggests the presence of some higher outliers.
Box Plot: The box plot further highlights the skewness, as the median is closer to the lower quartile, and there are many outliers on the right-hand side (values much higher than the rest).
Overall, the distribution of AMT_ANNUITY is right-skewed, with most values concentrated in the lower range, and a few higher values (outliers) extending toward the upper range.
appl_train_refined.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307511 non-null int64 1 TARGET 307511 non-null int64 2 NAME_CONTRACT_TYPE 307511 non-null object 3 CODE_GENDER 307511 non-null object 4 FLAG_OWN_CAR 307511 non-null object 5 FLAG_OWN_REALTY 307511 non-null object 6 CNT_CHILDREN 307511 non-null int64 7 AMT_INCOME_TOTAL 307511 non-null float64 8 AMT_CREDIT 307511 non-null float64 9 AMT_ANNUITY 307499 non-null float64 10 AMT_GOODS_PRICE 307233 non-null float64 11 NAME_TYPE_SUITE 306219 non-null object 12 NAME_INCOME_TYPE 307511 non-null object 13 NAME_EDUCATION_TYPE 307511 non-null object 14 NAME_FAMILY_STATUS 307511 non-null object 15 NAME_HOUSING_TYPE 307511 non-null object 16 REGION_POPULATION_RELATIVE 307511 non-null float64 17 DAYS_BIRTH 307511 non-null int64 18 DAYS_EMPLOYED 307511 non-null int64 19 DAYS_REGISTRATION 307511 non-null float64 20 DAYS_ID_PUBLISH 307511 non-null int64 21 FLAG_MOBIL 307511 non-null int64 22 FLAG_EMP_PHONE 307511 non-null int64 23 FLAG_WORK_PHONE 307511 non-null int64 24 FLAG_CONT_MOBILE 307511 non-null int64 25 FLAG_PHONE 307511 non-null int64 26 FLAG_EMAIL 307511 non-null int64 27 OCCUPATION_TYPE 211120 non-null object 28 CNT_FAM_MEMBERS 307509 non-null float64 29 REGION_RATING_CLIENT 307511 non-null int64 30 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 31 WEEKDAY_APPR_PROCESS_START 307511 non-null object 32 HOUR_APPR_PROCESS_START 307511 non-null int64 33 REG_REGION_NOT_LIVE_REGION 307511 non-null int64 34 REG_REGION_NOT_WORK_REGION 307511 non-null int64 35 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64 36 REG_CITY_NOT_LIVE_CITY 307511 non-null int64 37 REG_CITY_NOT_WORK_CITY 307511 non-null int64 38 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64 39 ORGANIZATION_TYPE 307511 non-null object 40 EXT_SOURCE_2 306851 non-null float64 41 EXT_SOURCE_3 246546 non-null float64 42 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64 43 FLOORSMAX_AVG 154491 non-null float64 44 YEARS_BEGINEXPLUATATION_MODE 157504 non-null float64 45 FLOORSMAX_MODE 154491 non-null float64 46 YEARS_BEGINEXPLUATATION_MEDI 157504 non-null float64 47 FLOORSMAX_MEDI 154491 non-null float64 48 TOTALAREA_MODE 159080 non-null float64 49 EMERGENCYSTATE_MODE 161756 non-null object 50 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 51 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 52 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 53 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 54 DAYS_LAST_PHONE_CHANGE 307510 non-null float64 55 FLAG_DOCUMENT_2 307511 non-null int64 56 FLAG_DOCUMENT_3 307511 non-null int64 57 FLAG_DOCUMENT_4 307511 non-null int64 58 FLAG_DOCUMENT_5 307511 non-null int64 59 FLAG_DOCUMENT_6 307511 non-null int64 60 FLAG_DOCUMENT_7 307511 non-null int64 61 FLAG_DOCUMENT_8 307511 non-null int64 62 FLAG_DOCUMENT_9 307511 non-null int64 63 FLAG_DOCUMENT_10 307511 non-null int64 64 FLAG_DOCUMENT_11 307511 non-null int64 65 FLAG_DOCUMENT_12 307511 non-null int64 66 FLAG_DOCUMENT_13 307511 non-null int64 67 FLAG_DOCUMENT_14 307511 non-null int64 68 FLAG_DOCUMENT_15 307511 non-null int64 69 FLAG_DOCUMENT_16 307511 non-null int64 70 FLAG_DOCUMENT_17 307511 non-null int64 71 FLAG_DOCUMENT_18 307511 non-null int64 72 FLAG_DOCUMENT_19 307511 non-null int64 73 FLAG_DOCUMENT_20 307511 non-null int64 74 FLAG_DOCUMENT_21 307511 non-null int64 75 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64 76 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64 77 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64 78 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 79 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64 80 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 dtypes: float64(27), int64(41), object(13) memory usage: 190.0+ MB
# Identify all the categorical variables
categorical_columns = appl_train_refined.select_dtypes(include=['object', 'category']).columns.tolist()
# Identify numerical variables (usually of type 'int64' or 'float64')
numerical_columns = appl_train_refined.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("Categorical Variables:", categorical_columns)
print("Numerical Variables:", numerical_columns)
Categorical Variables: ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'EMERGENCYSTATE_MODE'] Numerical Variables: ['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG', 'FLOORSMAX_AVG', 'YEARS_BEGINEXPLUATATION_MODE', 'FLOORSMAX_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# printing all the categorical columns
categorical_appl_train_refined = appl_train_refined[categorical_columns]
categorical_appl_train_refined.head()
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | No |
| 1 | Cash loans | F | N | N | Family | State servant | Higher education | Married | House / apartment | Core staff | MONDAY | School | No |
| 2 | Revolving loans | M | Y | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | MONDAY | Government | NaN |
| 3 | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | NaN |
| 4 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Core staff | THURSDAY | Religion | NaN |
4.4 Feature Engineering for Imputation model¶
Now lets do some basic feature engineering to identify important features and drop useless features
Encoding Categorical variables¶
Since my plan is to use a tree based algorithm for the imputation model. It will be a reasonable choice to use label encoding instead of OneHotEncoding as tree based algorithms are not sensitive to absolute magnitude of the encoded values
# Initialize the LabelEncoder
label_encoder = LabelEncoder()
# Loop through each specified column and apply label encoding
for column in categorical_columns:
categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
np.nan, label_encoder.fit_transform(categorical_appl_train_refined[column]))
categorical_appl_train_refined.head(2)
<ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(), <ipython-input-125-fc80b82bcd82>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined[column + '_encoded'] = np.where(categorical_appl_train_refined[column].isnull(),
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | ... | FLAG_OWN_REALTY_encoded | NAME_TYPE_SUITE_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | NAME_HOUSING_TYPE_encoded | OCCUPATION_TYPE_encoded | WEEKDAY_APPR_PROCESS_START_encoded | ORGANIZATION_TYPE_encoded | EMERGENCYSTATE_MODE_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 6.0 | 5.0 | 0.0 |
| 1 | Cash loans | F | N | N | Family | State servant | Higher education | Married | House / apartment | Core staff | ... | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 39.0 | 0.0 |
2 rows × 26 columns
# dropping non encoded columns
categorical_appl_train_refined.drop(columns = categorical_columns,inplace=True)
categorical_appl_train_refined.head(2)
<ipython-input-126-a9cb0059ca24>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy categorical_appl_train_refined.drop(columns = categorical_columns,inplace=True)
| NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | FLAG_OWN_REALTY_encoded | NAME_TYPE_SUITE_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | NAME_HOUSING_TYPE_encoded | OCCUPATION_TYPE_encoded | WEEKDAY_APPR_PROCESS_START_encoded | ORGANIZATION_TYPE_encoded | EMERGENCYSTATE_MODE_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 6.0 | 5.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 39.0 | 0.0 |
# merging the encoded categorical variables in main df
appl_train_2 = pd.concat([appl_train_refined,categorical_appl_train_refined], axis = 1)
appl_train_2.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_OWN_REALTY_encoded | NAME_TYPE_SUITE_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | NAME_HOUSING_TYPE_encoded | OCCUPATION_TYPE_encoded | WEEKDAY_APPR_PROCESS_START_encoded | ORGANIZATION_TYPE_encoded | EMERGENCYSTATE_MODE_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 6.0 | 5.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 39.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 1.0 | 11.0 | NaN |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 0.0 | 1.0 | 8.0 | 6.0 | 5.0 | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 3.0 | 4.0 | 37.0 | NaN |
5 rows × 94 columns
# drop non encoded categorical columns from appl_train_2
appl_train_2.drop(columns = categorical_columns,inplace=True)
appl_train_2.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_OWN_REALTY_encoded | NAME_TYPE_SUITE_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | NAME_HOUSING_TYPE_encoded | OCCUPATION_TYPE_encoded | WEEKDAY_APPR_PROCESS_START_encoded | ORGANIZATION_TYPE_encoded | EMERGENCYSTATE_MODE_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461 | -637 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 6.0 | 5.0 | 0.0 |
| 1 | 100003 | 0 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765 | -1188 | ... | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 39.0 | 0.0 |
| 2 | 100004 | 0 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046 | -225 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 8.0 | 1.0 | 11.0 | NaN |
| 3 | 100006 | 0 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005 | -3039 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 0.0 | 1.0 | 8.0 | 6.0 | 5.0 | NaN |
| 4 | 100007 | 0 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932 | -3038 | ... | 1.0 | 6.0 | 7.0 | 4.0 | 3.0 | 1.0 | 3.0 | 4.0 | 37.0 | NaN |
5 rows × 81 columns
#changing all the columns to float
appl_train_2 = appl_train_2.astype(float)
4.5 Correlation Matrix¶
Now that all variables have been brought to the same type (float). We will try to identify the correlation between variables through a correlation matrix. After this step, we can proceed with dropping certain features which have very weak correlation with AMT_ANNUITY and merge variables which have very strong correlation with each other
# Calculate the correlation matrix
correlation_matrix = appl_train_2.corr()
# Create the heatmap
plt.figure(figsize=(20, 16)) # You can adjust the size depending on the number of features
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()
Its very difficult to make any inference from the above heatmap as most values fall in the gray range. Hence to make a better visualization, I want to filter out some of the weak correlated values. Also I am interested to see the distrbution of values in the correlation matric itself
# Flatten the correlation matrix and filter out the 1.0 values (self-correlation)
correlation_values = correlation_matrix.values.flatten()
correlation_values = correlation_values[correlation_values != 1.0] # Exclude self-correlations
plt.figure(figsize=(10, 6))
sns.histplot(correlation_values, bins=20, kde=True)
plt.title('Distribution of Correlation Coefficients')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Frequency')
plt.grid()
plt.show()
From the graph, 0.05 seems to be a reasonable cut off. Therefore, I am dropping all the columns from the dataset that have a correlation value of <= 0.05 with AMT_ANNUITY
correlation_with_amt_annuity = correlation_matrix['AMT_ANNUITY']
# Identify columns to drop (correlation <= 0.05)
columns_to_drop = correlation_with_amt_annuity[abs(correlation_with_amt_annuity) <= 0.05].index.tolist()
# Drop those columns from the DataFrame
appl_train_3 = appl_train_2.drop(columns=columns_to_drop)
appl_train_3.head()
| AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | ... | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -637.0 | 1.0 | 0.0 | 1.0 | 2.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 5.0 |
| 1 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -1188.0 | 1.0 | 0.0 | 2.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 39.0 |
| 2 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -225.0 | 1.0 | 0.0 | 1.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 7.0 | 4.0 | 3.0 | 11.0 |
| 3 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -3039.0 | 1.0 | 0.0 | 2.0 | 2.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 4.0 | 0.0 | 5.0 |
| 4 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -3038.0 | 1.0 | 0.0 | 1.0 | 2.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 37.0 |
5 rows × 30 columns
appl_train_3.shape
(307511, 30)
We have reduced the dataset drastically from 94 columns to just 30 columns. I am not sure if selecting threshold of 0.1 was a good idea or not. But we will see
# Calculate the correlation matrix
correlation_matrix2 = appl_train_3.corr()
# Create the heatmap
plt.figure(figsize=(16, 16)) # You can adjust the size depending on the number of features
sns.heatmap(correlation_matrix2, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()
Observations and Inferences from the Correlation MAtrix¶
The 2 features that show the strongest correlation with our target "AMT_ANNUITY" is AMT_CREDIT and AMT_GOODS_PRICE with correlation values of 0.77 and 0.88 respectively.
- AMT_CREDIT - Credit amount of the loan
- AMT_GOODS_PRICE - For consumer loans it is the price of the goods for which the loan is given
However, when we see the correlation with these 2 variables against each other, we find that they have a correlation of 0.99
The above implies that if we allow both variables to stay in the training data, it could lead to a problem of collinearity
Hence I am making the decision to merge these 2 features into a single feature. We will explore how to merge them in a sensible way in the following cells
We can similar problems of very strong correlation among 2 other groups of variables -
group 1 - (FLOORSMAX_AVG, FLOORSMAX_MODE, FLOORSMAX_MEDI) - These are measures of central tendency, hence there is no point merging them instead we will choose any one and drop the other 2
group 2 - (REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY) - These are 2 diff ways to represent the same information i.e - rating of the region where client lives (1,2,3). Hence we will pick the REGION_RATING_CLIENT_W_CITY and drop the other one as REGION_RATING_CLIENT_W_CITY includes the clients cities also not just region so its more accurate data
That is about all, now we can make the above changes and proceed with data preparation
# merging the columns AMT_CREDIT and AMT_GOODS_PRICE
# First explore the value counts of each
print(appl_train_3[['AMT_CREDIT','AMT_GOODS_PRICE']].describe())
#checking for null values
print("Null value count in AMT_CREDIT : ", appl_train_3['AMT_CREDIT'].isnull().sum())
print("Null value count in AMT_GOODS_PRICE : ",appl_train_3['AMT_GOODS_PRICE'].isnull().sum())
AMT_CREDIT AMT_GOODS_PRICE count 3.075110e+05 3.072330e+05 mean 5.990260e+05 5.383962e+05 std 4.024908e+05 3.694465e+05 min 4.500000e+04 4.050000e+04 25% 2.700000e+05 2.385000e+05 50% 5.135310e+05 4.500000e+05 75% 8.086500e+05 6.795000e+05 max 4.050000e+06 4.050000e+06 Null value count in AMT_CREDIT : 0 Null value count in AMT_GOODS_PRICE : 278
# check for the distribution of the 2 variables
plot_distribution(appl_train_refined,'AMT_CREDIT')
plot_distribution(appl_train_refined,'AMT_GOODS_PRICE')
I have decided to take the ratio of AMT_CREDIT to AMT_GOODS_PRICE as the merged variable
# Merging and creating the variable RATIO_AMT_CREDIT_TO_GOODS_PRICE
# Impute null values in AMT_GOODS_PRICE with its median
median_goods_price = appl_train_3['AMT_GOODS_PRICE'].median()
appl_train_3['AMT_GOODS_PRICE'].fillna(median_goods_price, inplace=True)
# Calculate the ratio
appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'] = (
appl_train_3['AMT_CREDIT'] / appl_train_3['AMT_GOODS_PRICE']
)
# Round the ratio to 2 decimal places
appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'] = (
appl_train_3['RATIO_AMT_CREDIT_TO_GOODS_PRICE'].round(2)
)
appl_train_3.head(2)
<ipython-input-138-332679d3b631>:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
appl_train_3['AMT_GOODS_PRICE'].fillna(median_goods_price, inplace=True)
| AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | ... | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | RATIO_AMT_CREDIT_TO_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -637.0 | 1.0 | 0.0 | 1.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 5.0 | 1.16 |
| 1 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -1188.0 | 1.0 | 0.0 | 2.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 39.0 | 1.15 |
2 rows × 31 columns
Take care of the other 2 groups of features with collinearity problems
# dropping columns FLOORSMAX_MODE, FLOORSMAX_MEDI and keeping only FLOORSMAX_AVG
# dropping column REGION_RATING_CLIENT
# dropping columns AMT_CREDIT and AMT_GOODS_PRICE
appl_train_4 = appl_train_3.drop(columns = ['AMT_CREDIT','AMT_GOODS_PRICE','FLOORSMAX_MODE','FLOORSMAX_MEDI','REGION_RATING_CLIENT'])
appl_train_4.head(2)
| AMT_INCOME_TOTAL | AMT_ANNUITY | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | ... | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | RATIO_AMT_CREDIT_TO_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 202500.0 | 24700.5 | 0.018801 | -637.0 | 1.0 | 0.0 | 1.0 | 2.0 | 10.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 5.0 | 1.16 |
| 1 | 270000.0 | 35698.5 | 0.003541 | -1188.0 | 1.0 | 0.0 | 2.0 | 1.0 | 11.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 39.0 | 1.15 |
2 rows × 26 columns
4.6 Feature Scaling and Normalization for Imputation Model¶
Now that we have decided our final set of features, lets scale the columns - 'AMT_INCOME_TOTAL'
mmscaler = MinMaxScaler()
# Scale the specified columns
appl_train_4[['AMT_INCOME_TOTAL']] = mmscaler.fit_transform(
appl_train_4[['AMT_INCOME_TOTAL']]
)
appl_train_4.head(2)
| AMT_INCOME_TOTAL | AMT_ANNUITY | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | ... | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | RATIO_AMT_CREDIT_TO_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001512 | 24700.5 | 0.018801 | -637.0 | 1.0 | 0.0 | 1.0 | 2.0 | 10.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 5.0 | 1.16 |
| 1 | 0.002089 | 35698.5 | 0.003541 | -1188.0 | 1.0 | 0.0 | 2.0 | 1.0 | 11.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 39.0 | 1.15 |
2 rows × 26 columns
# dealing with DAYS_EMPLOYED
# Make DAYS_EMPLOYED positive
appl_train_4['DAYS_EMPLOYED'] = appl_train_4['DAYS_EMPLOYED'].abs()
appl_train_4['DAYS_LAST_PHONE_CHANGE'] = appl_train_4['DAYS_LAST_PHONE_CHANGE'].abs()
#apply minmax scalar
appl_train_4['DAYS_EMPLOYED'] = mmscaler.fit_transform(appl_train_4[['DAYS_EMPLOYED']])
appl_train_4['DAYS_LAST_PHONE_CHANGE'] = mmscaler.fit_transform(appl_train_4[['DAYS_LAST_PHONE_CHANGE']])
appl_train_4.head(2)
| AMT_INCOME_TOTAL | AMT_ANNUITY | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | ... | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | RATIO_AMT_CREDIT_TO_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001512 | 24700.5 | 0.018801 | 0.001744 | 1.0 | 0.0 | 1.0 | 2.0 | 10.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 7.0 | 4.0 | 3.0 | 5.0 | 1.16 |
| 1 | 0.002089 | 35698.5 | 0.003541 | 0.003253 | 1.0 | 0.0 | 2.0 | 1.0 | 11.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 39.0 | 1.15 |
2 rows × 26 columns
# Dropping the rows where AMT_ANNUITY is null and storing them in a separate df
final_rows_for_imputation = appl_train_4[appl_train_3['AMT_ANNUITY'].isnull()]
appl_train_4 = appl_train_4.dropna(subset=['AMT_ANNUITY'])
final_rows_for_imputation
| AMT_INCOME_TOTAL | AMT_ANNUITY | REGION_POPULATION_RELATIVE | DAYS_EMPLOYED | FLAG_EMP_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | ... | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | NAME_CONTRACT_TYPE_encoded | CODE_GENDER_encoded | FLAG_OWN_CAR_encoded | NAME_INCOME_TYPE_encoded | NAME_EDUCATION_TYPE_encoded | NAME_FAMILY_STATUS_encoded | ORGANIZATION_TYPE_encoded | RATIO_AMT_CREDIT_TO_GOODS_PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47531 | 0.001320 | NaN | 0.026392 | 0.006908 | 1.0 | 0.0 | 1.0 | 2.0 | 13.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 3.0 | 5.0 | 1.00 |
| 50035 | 0.000589 | NaN | 0.035792 | 0.003477 | 1.0 | 0.0 | 2.0 | 2.0 | 20.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 3.0 | 0.0 | 3.0 | 1.00 |
| 51594 | 0.001512 | NaN | 0.046220 | 0.010815 | 1.0 | 0.0 | 2.0 | 1.0 | 15.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 4.0 | 1.0 | 42.0 | 1.20 |
| 55025 | 0.001166 | NaN | 0.035792 | 0.005585 | 1.0 | 0.0 | 2.0 | 2.0 | 13.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 1.0 | 1.0 | 11.0 | 1.32 |
| 59934 | 0.001512 | NaN | 0.046220 | 0.006839 | 1.0 | 0.0 | 1.0 | 1.0 | 15.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 7.0 | 4.0 | 3.0 | 33.0 | 1.00 |
| 75873 | 0.001012 | NaN | 0.022800 | 0.006708 | 1.0 | 0.0 | 1.0 | 2.0 | 15.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 7.0 | 1.0 | 3.0 | 5.0 | 1.32 |
| 89343 | 0.000550 | NaN | 0.015221 | 0.010188 | 1.0 | 0.0 | 2.0 | 2.0 | 11.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 4.0 | 1.0 | 30.0 | 1.00 |
| 123872 | 0.001512 | NaN | 0.019689 | 0.009692 | 1.0 | 1.0 | 2.0 | 2.0 | 12.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 4.0 | 0.0 | 42.0 | 1.29 |
| 207186 | 0.001243 | NaN | 0.018634 | 0.001292 | 1.0 | 0.0 | 2.0 | 2.0 | 13.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 40.0 | 1.00 |
| 227939 | 0.002474 | NaN | 0.046220 | 0.004047 | 1.0 | 0.0 | 2.0 | 1.0 | 11.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 5.0 | 1.26 |
| 239329 | 0.001127 | NaN | 0.026392 | 0.007899 | 1.0 | 0.0 | 2.0 | 2.0 | 14.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 4.0 | 1.0 | 33.0 | 1.00 |
| 241835 | 0.002474 | NaN | 0.072508 | 0.000416 | 1.0 | 0.0 | 2.0 | 1.0 | 15.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 7.0 | 1.0 | 1.0 | 42.0 | 1.10 |
12 rows × 26 columns
# converting the dataframe to numpy arrays
X = np.array(appl_train_4.drop('AMT_ANNUITY', axis=1))
y = np.array(appl_train_4['AMT_ANNUITY'])
print(X[0:3,:])
print(X.shape)
[[1.51186991e-03 1.88010000e-02 1.74404438e-03 1.00000000e+00
0.00000000e+00 1.00000000e+00 2.00000000e+00 1.00000000e+01
0.00000000e+00 0.00000000e+00 2.62948593e-01 8.33000000e-02
1.49000000e-02 2.64212488e-01 1.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
7.00000000e+00 4.00000000e+00 3.00000000e+00 5.00000000e+00
1.16000000e+00]
[2.08891949e-03 3.54100000e-03 3.25262907e-03 1.00000000e+00
0.00000000e+00 2.00000000e+00 1.00000000e+00 1.10000000e+01
0.00000000e+00 0.00000000e+00 6.22245775e-01 2.91700000e-01
7.14000000e-02 1.92917055e-01 1.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
4.00000000e+00 1.00000000e+00 1.00000000e+00 3.90000000e+01
1.15000000e+00]
[3.57770742e-04 1.00320000e-02 6.16028233e-04 1.00000000e+00
0.00000000e+00 1.00000000e+00 2.00000000e+00 9.00000000e+00
0.00000000e+00 0.00000000e+00 5.55912083e-01 nan
nan 1.89888164e-01 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
7.00000000e+00 4.00000000e+00 3.00000000e+00 1.10000000e+01
1.00000000e+00]]
(307499, 25)
print(y[0:3,])
print(y.shape)
[24700.5 35698.5 6750. ] (307499,)
# Train-test split (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34)
print("Training Features:\n", X_train.shape)
print("Training Target:\n", y_train.shape)
print("Testing Features:\n", X_test.shape)
print("Testing Target:\n", y_test.shape)
Training Features: (245999, 25) Training Target: (245999,) Testing Features: (61500, 25) Testing Target: (61500,)
4.7 LIGHTGBM Model Training with 5 fold cross validation¶
I plan on training the data in the following way
- As you can see I have already separated 20% data for testing.
- Now the train data (i.e 80%) will be under go the 5 fold cross validation process
- Then we can get the best RMSE (my evaluation metric for this moodel) for 5 fold cross validation process
- Finally we will build the model on the whole training data and make the
# Create LightGBM dataset
train_data = lgbm.Dataset(X_train, label=y_train)
test_data = lgbm.Dataset(X_test, label=y_test, reference=train_data)
# Set parameters for the LightGBM model
params = {
'objective': 'regression',
'metric': 'rmse',
'boosting_type': 'gbdt',
'learning_rate': 0.03,
'num_leaves': 30,
'verbose': 1
}
# Perform K-Fold Cross-Validation
cv_results = lgbm.cv(
params,
train_data,
num_boost_round=1000, # Number of boosting rounds
nfold=5, # Number of folds
metrics='rmse', # Evaluation metric
stratified=False
)
cv_score = min(cv_results['valid rmse-mean'])
# Print the CV score
print("CV RMSE Score: ", cv_score)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.130169 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.095617 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.090039 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.200164 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.090620 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 196796, number of used features: 25 [LightGBM] [Info] Start training from score 27104.863496 [LightGBM] [Info] Start training from score 27100.001339 [LightGBM] [Info] Start training from score 27152.839776 [LightGBM] [Info] Start training from score 27121.466775 [LightGBM] [Info] Start training from score 27122.220380 CV RMSE Score: 10107.937051493398
# Print cross-validation results
print("Best RMSE: ", min(cv_results['valid rmse-mean']))
Best RMSE: 10107.937051493398
# Train the model with early stopping and optimal num boost rounds that we got from 5 fold cv
lgbm_model = lgbm.train(
params,
train_data,
num_boost_round=1000,
valid_sets=[test_data]
)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.053520 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1654 [LightGBM] [Info] Number of data points in the train set: 245999, number of used features: 25 [LightGBM] [Info] Start training from score 27120.115110
# Predict on the test set
y_pred = lgbm_model.predict(X_test, num_iteration=lgbm_model.best_iteration)
# Evaluate the model
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Test RMSE: ", rmse)
Test RMSE: 10117.617427842439
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_regression.py:492: FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'. warnings.warn(
# Convert y_test and y_pred to pandas Series
y_test_series = pd.Series(y_test).reset_index(drop=True)
y_pred_series = pd.Series(y_pred)
# Line Plot for Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.plot(y_test_series, label='Actual Values', color='blue', linewidth=2)
plt.plot(y_pred_series, label='Predicted Values', color='orange', linestyle='--', linewidth=2)
plt.title('Actual vs Predicted Values')
plt.xlabel('Sample Index')
plt.ylabel('Values')
plt.legend()
plt.grid()
plt.show()
# Now using the above model to impute the 12 missing values
X_missing = np.array(final_rows_for_imputation.drop('AMT_ANNUITY', axis=1))
y_missing = np.array(final_rows_for_imputation['AMT_ANNUITY'])
# predict missing values
y_pred_missing = lgbm_model.predict(X_missing, num_iteration=lgbm_model.best_iteration)
y_pred_missing
array([25772.46449965, 21599.84301145, 29199.5341337 , 26290.14381451,
30433.70880488, 22291.982206 , 22188.38300024, 35438.11132234,
27516.3155974 , 29800.85762735, 29278.1063208 , 55642.9651374 ])
Now we can use the imputed values in the training of the main model for credit risk default prediction
4.7 Imputation of other variables by Different Method¶
Here instead of building imputation models, we have opted for an analytical ruled based approach where we will impute the missing columns based on the most appropriate central tendancy metric.
##Imputation Function from starting to end
# Check Missing Values
def check_missing_values_p(df: pd.DataFrame, column_name: str) -> int:
"""
Function to count missing values (NaN) in a specified column in a Pandas DataFrame.
Parameters:
df : pd.DataFrame
The input Pandas DataFrame.
column_name : str
The name of the column to check for missing values.
Returns:
int: Number of missing values in the column.
"""
missing_values = df[column_name].isna().sum()
print(f"Missing values in {column_name}: {missing_values}")
return missing_values
# Check Distribution of Column
def check_distribution_p(df: pd.DataFrame, column_name: str) -> dict:
"""
Function to display basic statistics (mean, stddev, min, max) of a specified column in a Pandas DataFrame.
Parameters:
df : pd.DataFrame
The input Pandas DataFrame.
column_name : str
The name of the column to check the distribution.
Returns:
dict: A dictionary containing count, mean, stddev, min, and max values.
"""
summary = df[column_name].describe()
summary_dict = {
'count': summary['count'],
'mean': summary['mean'],
'stddev': summary['std'] if 'std' in summary.index else None,
'min': summary['min'],
'max': summary['max']
}
print(summary_dict)
return summary_dict
# Choose Imputation Method
def choose_imputation_method_p(count, mean, stddev, min_val, max_val) -> str:
"""
Function to suggest the best imputation method (mean, median, or mode)
based on the summary statistics of a column.
Parameters:
count (int): Total count of non-null values in the column.
mean (float): Mean of the column values.
stddev (float): Standard deviation of the column values.
min_val (float): Minimum value in the column.
max_val (float): Maximum value in the column.
Returns:
str: Recommended imputation method ('mean', 'median', or 'mode').
"""
# If there's a small standard deviation, mean imputation is okay
if stddev <= mean:
return "mean"
# If standard deviation is high, there's likely more variation or outliers
if stddev > mean:
# Mode is useful if the minimum value is 0 and data is skewed (potentially binary or categorical)
if min_val == 0 and max_val / mean > 5:
return "mode"
else:
return "median"
# Default to median if no strong preference can be inferred
return "median"
# Impute Missing Values with Mean, Median, Mode
def impute_method_p(df: pd.DataFrame, column_name: str, method: str) -> pd.DataFrame:
"""
Function to impute missing values in a column with the specified method (mean, median, mode).
Parameters:
df : pd.DataFrame
The input Pandas DataFrame.
column_name : str
The name of the column to impute missing values for.
method : str
The imputation method ('mean', 'median', 'mode').
Returns:
pd.DataFrame: DataFrame with missing values imputed.
"""
if method == 'mean':
mean_value = df[column_name].mean()
print(f"Mean for {column_name}: {mean_value}")
df[column_name].fillna(mean_value, inplace=True)
elif method == 'median':
median_value = df[column_name].median()
print(f"Median for {column_name}: {median_value}")
df[column_name].fillna(median_value, inplace=True)
elif method == 'mode':
mode_value = df[column_name].mode()[0]
print(f"Mode for {column_name}: {mode_value}")
df[column_name].fillna(mode_value, inplace=True)
else:
print("Invalid method")
return df
# Verify Imputation
def verify_imputation_p(df: pd.DataFrame, column_name: str) -> int:
"""
Function to verify if all missing values in a column have been imputed.
Parameters:
df : pd.DataFrame
The input Pandas DataFrame.
column_name : str
The name of the column to verify imputation for.
Returns:
int: Number of remaining missing values.
"""
missing_values = df[column_name].isna().sum()
print(f"Remaining missing values in {column_name}: {missing_values}")
return missing_values
column='DEF_30_CNT_SOCIAL_CIRCLE'
missing_count = check_missing_values_p(appl_train, column)
Missing values in DEF_30_CNT_SOCIAL_CIRCLE: 1021
summary_stats = check_distribution_p(appl_train,column)
{'count': 306490.0, 'mean': 0.1434206662533851, 'stddev': 0.4466984293825373, 'min': 0.0, 'max': 34.0}
# Choose imputation method
method = choose_imputation_method_p(
count=summary_stats['count'],
mean=summary_stats['mean'],
stddev=summary_stats['stddev'],
min_val=summary_stats['min'],
max_val=summary_stats['max']
)
print(method)
mode
appl_train = impute_method_p(appl_train, column, method)
Mode for DEF_30_CNT_SOCIAL_CIRCLE: 0.0
<ipython-input-157-4011cf5ba0bd>:107: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df[column_name].fillna(mode_value, inplace=True)
verify_imputation_p(appl_train, column)
Remaining missing values in DEF_30_CNT_SOCIAL_CIRCLE: 0
0
summary_stats = check_distribution_p(appl_train, column)
{'count': 307511.0, 'mean': 0.14294448003486054, 'stddev': 0.4460325557437661, 'min': 0.0, 'max': 34.0}
5. Exploratory Data Analysis (for Application_Train file after cleaning)¶
There are 10 files in our library: 7 of them are data sources, and the remaining 3 are the train, test, and sample submission files. We'll check on the main file first (application_train.csv) and see what we can make out of it, then use the test set (application_test.csv) to make our submission.
appl_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
appl_train.shape
(307511, 122)
For the training file, we have a total of 307511 observations and 122 features to consider -- with integer, float and object datatypes, and 67 features having null values.
The test file is almost similar: having 48744 observations, 121 features (minus the predictor variable 'TARGET'), and 64 features having null values.
Note on the words being used:
- observations == samples == rows
- features == fields == columns
- defaulted == failed to meet the legal obligations of the loan
# Print first 5 rows of the training file
appl_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
Since our model is to classify the 'target' column, we can plot it to visualize the current sample distribution.
# Show target distribution
sns.set_style('darkgrid')
print(appl_train.TARGET.value_counts())
appl_train.TARGET.plot.hist(color='mediumseagreen').set_xlabel('Target value: 0 or 1');
TARGET 0 282686 1 24825 Name: count, dtype: int64
Base on the above information, the dataset is imbalanced. Only around 8% of the training set aren't repaid.
JUST TO BE CLEAR, OUR INDICATOR ON WHETHER A CURRENT/ PREVIOUS CUSTOMER DEFAULTED ON A LOAN CAN BE SEEN AT 'TARGET' COLUMN:
TARGET == 0 --> individuals who paid their loan
TARGET == 1 --> individuals who did NOT repay their loan
Now, let's take a closer look on the fields.
Analyzing Data Insights (EDA)¶
We'll be taking note of the things below during our field inspection:
- Completing - any missing values to fill in?
- Correlation - which features contribute significantly to our solution goal?
- Correcting - any outlier that skews our data terribly? Do we have some unusual error seen on the dataset? We may need to correct or exclude inaccurate features.
- Conversion - most of the time, the text features need to be converted to numerical values for our model training.
- Creating - maybe we can create a new feature out of the existing set of highly correlated features?
Before anything else, it's important to understand why these fields are recorded in the first place. And below is my understanding on how do we know if an applicant is a good candidate for a loan application.

(I'm not claiming that this is right, but for me, this makes sense.)
During field inspection, I'll try to comprehend whether a certain field falls on any of these questions. It can serve as a guide on how we deal with a certain feature. Below are excel sheet snippets of the initial sorting that I made on all the fields (please see TAG column).
Decisions:
- 'SK_ID_CURR' is just the loan ID which is unique for every individual. It will not contribute to the prediction algorithm. --> DROP
- The 'TARGET' variable is the one that we are trying to predict, so we will NOT use it as a feature.
# COLUMNS TO BE DROPPED
list_col_drop = ['SK_ID_CURR']
tempY=appl_train[appl_train.FLAG_OWN_CAR=='Y']
tempN=appl_train[appl_train.FLAG_OWN_CAR=='N']
tempY_targ1=tempY[tempY.TARGET==1]
tempN_targ1=tempN[tempN.TARGET==1]
print('People count who owns a car:',len(tempY),'(',round(len(tempY)/len(appl_train.index)*100,2),'%)')
print('People count who DOES NOT own a car:',len(tempN),'(',round(len(tempN)/len(appl_train.index)*100,2),'%)')
print('Percentage of people who defaulted (with cars):',round(len(tempY_targ1)/len(tempY)*100,2),'%')
print('Percentage of people who defaulted (no cars):',round(len(tempN_targ1)/len(tempN)*100,2),'%')
People count who owns a car: 104587 ( 34.01 %) People count who DOES NOT own a car: 202924 ( 65.99 %) Percentage of people who defaulted (with cars): 7.24 % Percentage of people who defaulted (no cars): 8.5 %
tempY=appl_train[appl_train.FLAG_OWN_REALTY=='Y']
tempN=appl_train[appl_train.FLAG_OWN_REALTY=='N']
tempY_targ1=tempY[tempY.TARGET==1]
tempN_targ1=tempN[tempN.TARGET==1]
Observations:
- It's a small difference but it looks like people with no car and/or realty tend to default more than those who have.
Decisions:
- I'll be converting these 2 categorical fields into one nominal field called 'assets'.
# CREATE NEW COLUMN : 0 - none, 1 - with car no realty, 2 - no car with realty, 3 - with car with realty
list_col_new_asset = ['FLAG_OWN_CAR','FLAG_OWN_REALTY']
The expenditure-related set is quite a lot. We can break it down base on the fields' relatedness.
Expenditure-related: THE LOAN ITSELF.
sns.pairplot(appl_train[['NAME_CONTRACT_TYPE','AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE']],hue='NAME_CONTRACT_TYPE');
corr1=round(appl_train.AMT_CREDIT.corr(appl_train.AMT_GOODS_PRICE),2)
corr2=round(appl_train.AMT_ANNUITY.corr(appl_train.AMT_CREDIT),2)
corr3=round(appl_train.AMT_ANNUITY.corr(appl_train.AMT_GOODS_PRICE),2)
print('Correlation of Credit amount vs Price of goods:',corr1)
print('Correlation of Annuity amount vs Credit amount:',corr2)
print('Correlation of Annuity amount vs Price of goods:',corr3)
Correlation of Credit amount vs Price of goods: 0.99 Correlation of Annuity amount vs Credit amount: 0.77 Correlation of Annuity amount vs Price of goods: 0.78
cash=appl_train[appl_train.NAME_CONTRACT_TYPE == 'Cash loans']
rev=appl_train[appl_train.NAME_CONTRACT_TYPE == 'Revolving loans']
def_cash=cash[cash.TARGET==1]
def_rev=rev[rev.TARGET==1]
print('Percentage of defaulted cash loan:',round(len(def_cash)/len(cash)*100,2,),'%')
print('Percentage of defaulted revolving loan:',round(len(def_rev)/len(rev)*100,2),'%')
sns.catplot(data=appl_train,x='NAME_CONTRACT_TYPE',hue='TARGET',kind='count');
Percentage of defaulted cash loan: 8.35 % Percentage of defaulted revolving loan: 5.48 %
Observations:
- AMT_CREDIT and AMT_GOODS_PRICE are highly correlated (scoring 0.99!), and has a positive linear slope - which makes sense because as the price of goods for which the loan is given gets higher, the credit amount of the loan (ofcourse) gets higher too.
- AMT_ANNUITY is also highly correlated to AMT_CREDIT and AMT_GOODS_PRICE with a positive linearity. It's because the annuity is the monthly due amount.
- NAME_CONTRACT_TYPE: Accounting for those who defaulted is much bigger in terms of cash loan than those with revolving loan, however, we must note that cash loan is significantly more popular to our sample consumers than the other.
Decisions:
- AMT_GOODS_PRICE - to drop as it is quite redundant. We already have the AMT_CREDIT field to account for the loan amount.
- AMT_ANNUITY - create new feature PERCENT_ANNUITY_INCOME to indicate the loan annuity amount relative to the person's total income.
- AMT_CREDIT - create new feature PERCENT_CREDIT_INCOME to indicate the loan credit amount relative to the person's total income.
- NAME_CONTRACT_TYPE - to drop since the defaulting rate for both loan type has less than 3% difference. We can focus more on the person's percentage of annuity or credit amount relative to their income.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['NAME_CONTRACT_TYPE','AMT_GOODS_PRICE'])
# CREATE NEW COLUMN : PERCENT_ANNUITY_INCOME
list_col_new_annuity = ['AMT_ANNUITY','AMT_INCOME_TOTAL']
# CREATE NEW COLUMN : CREDIT_ANNUITY_INCOME
list_col_new_credit = ['AMT_CREDIT','AMT_INCOME_TOTAL']
Expenditure-related: SIZE OF HOUSEHOLD
sns.catplot(data=appl_train,x='NAME_FAMILY_STATUS',hue='TARGET',kind='count');
plt.xticks(rotation=90);
appl_train[['CNT_CHILDREN','NAME_FAMILY_STATUS','CNT_FAM_MEMBERS']][appl_train.NAME_FAMILY_STATUS=='Married'].tail()
| CNT_CHILDREN | NAME_FAMILY_STATUS | CNT_FAM_MEMBERS | |
|---|---|---|---|
| 307500 | 0 | Married | 2.0 |
| 307501 | 3 | Married | 5.0 |
| 307502 | 1 | Married | 3.0 |
| 307509 | 0 | Married | 2.0 |
| 307510 | 0 | Married | 2.0 |
Observations:
- We have a large number of married customers in our sample population. The married set also contains the most frequent defaulting individuals.
- 'CNT_CHILDREN', 'NAME_FAMILY_STATUS', 'CNT_FAM_MEMBERS' are all related to family size. And if we only want to consider the family size for approximation of expenditures, we already have 'CNT_FAM_MEMBERS' wherein it accounts for the customer itself, plus spouse (if any), plus number of children (if any).
Decisions:
- Use 'CNT_FAM_MEMBER' as a feature model. Fill in missing values.
- Drop 'CNT_CHILDREN' and 'NAME_FAMILY_STATUS'.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['CNT_CHILDREN', 'NAME_FAMILY_STATUS'])
# COMPLETE COLUMN :
list_col_fill_fam = ['CNT_FAM_MEMBERS']
Expenditure-related: TRAVEL
corr1=round(appl_train.OWN_CAR_AGE.corr(appl_train.TARGET),2)
corr2=round(appl_train.REG_REGION_NOT_WORK_REGION.corr(appl_train.TARGET),2)
corr3=round(appl_train.REG_CITY_NOT_WORK_CITY.corr(appl_train.TARGET),2)
no_car,yes_car = appl_train.FLAG_OWN_CAR.value_counts()
print('Correlation of Age of Car vs Target:',corr1)
print('Correlation of Registered Region aint Work Region vs Target:',corr2)
print('Correlation of Registered City aint Work City vs Target:',corr3)
Correlation of Age of Car vs Target: 0.04 Correlation of Registered Region aint Work Region vs Target: 0.01 Correlation of Registered City aint Work City vs Target: 0.05
appl_train[['OWN_CAR_AGE','REG_REGION_NOT_WORK_REGION','REG_CITY_NOT_WORK_CITY']].describe()
| OWN_CAR_AGE | REG_REGION_NOT_WORK_REGION | REG_CITY_NOT_WORK_CITY | |
|---|---|---|---|
| count | 104582.000000 | 307511.000000 | 307511.000000 |
| mean | 12.061091 | 0.050769 | 0.230454 |
| std | 11.944812 | 0.219526 | 0.421124 |
| min | 0.000000 | 0.000000 | 0.000000 |
| 25% | 5.000000 | 0.000000 | 0.000000 |
| 50% | 9.000000 | 0.000000 | 0.000000 |
| 75% | 15.000000 | 0.000000 | 0.000000 |
| max | 91.000000 | 1.000000 | 1.000000 |
print('How many customers own a car? :',yes_car)
print('How many customers do NOT own a car? :',no_car)
print('How many missing values on OWN_CAR_AGE? :',appl_train.OWN_CAR_AGE.isnull().sum())
How many customers own a car? : 104587 How many customers do NOT own a car? : 202924 How many missing values on OWN_CAR_AGE? : 202929
Observations:
- We have a low correlation between these travel-related fields and the target variable.
- The number of missing values on 'OWN_CAR_AGE' is almost == number of customers who do NOT own a car -- which CAUSES the field to have missing values.
Decisions:
- Drop REG_REGION_NOT_WORK_REGION and REG_CITY_NOT_WORK_CITY as there is low correlation wrt TARGET variable.
- Create new ordinal category 'Car Expenditure' from a continuous numerical feature 'OWN_CAR_AGE'. Perhaps the older the car, the higher the maintenance expenditure.
# CREATE NEW COLUMN : EXPENDITURE_CAR : 0 - no car, +=1 per age band
list_col_new_car = ['OWN_CAR_AGE']
# COLUMNS TO BE DROPPED
list_col_drop.extend(['REG_REGION_NOT_WORK_REGION','REG_CITY_NOT_WORK_CITY'])
Expenditure-related: RESIDENCE
temp = ['NAME_HOUSING_TYPE',
'APARTMENTS_AVG',
'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG',
'COMMONAREA_AVG',
'ELEVATORS_AVG',
'ENTRANCES_AVG',
'FLOORSMAX_AVG',
'FLOORSMIN_AVG',
'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG',
'APARTMENTS_MODE',
'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE',
'YEARS_BUILD_MODE',
'COMMONAREA_MODE',
'ELEVATORS_MODE',
'ENTRANCES_MODE',
'FLOORSMAX_MODE',
'FLOORSMIN_MODE',
'LANDAREA_MODE',
'LIVINGAPARTMENTS_MODE',
'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAREA_MODE',
'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI',
'ELEVATORS_MEDI',
'ENTRANCES_MEDI',
'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI',
'LANDAREA_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI',
'NONLIVINGAPARTMENTS_MEDI',
'NONLIVINGAREA_MEDI',
'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE',
'TOTALAREA_MODE',
'WALLSMATERIAL_MODE',
'EMERGENCYSTATE_MODE']
# Nullity by column
msno.bar(appl_train[temp],figsize=(20,5));
# Identify non-numeric columns
non_numeric_columns = appl_train.select_dtypes(include=['object']).columns
print("Categorical Columns:", non_numeric_columns)
Categorical Columns: Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
dtype='object')
# One-hot encode categorical columns
appl_train_eda = pd.get_dummies(appl_train, columns=non_numeric_columns, drop_first=True)# trurned to false for the timing for visualization and hypothesis testing
# Calculate the correlation matrix
correlations = appl_train_eda.corr()
# Set a threshold for selecting strong correlations
threshold = 0.05
# Select features with correlations above the threshold with the target
best_features = correlations['TARGET'][abs(correlations['TARGET']) > threshold].index
# Create a subset of the correlation matrix with these features and the target
best_corr_matrix = correlations.loc[best_features, best_features]
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(best_corr_matrix, annot=True, cmap='mako', vmin=-1, vmax=1, fmt=".2f", square=True)
plt.title("Heatmap of Best Correlated Features with Target Variable")
plt.show()
This heatmap above shows the correlation of those fields with null values: how strongly the presence or absence of one variable affects the presence of another.
- 1 : positive correlation (blue) >> if a variable appears, the other variable definitely does too.
- 0 : zero correlation (white) >> variables appearing or not appearing have no effect on one another.
- -1 : negative correlation (red) >> if a variable appears, the other variable definitely does NOT.
appl_train.columns.to_list()
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
for i in ['APARTMENTS_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG','NONLIVINGAREA_MEDI']:
temp=appl_train[['NAME_HOUSING_TYPE']][appl_train[i].isnull()]
sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
plt.xticks(rotation=20)
title = 'Housing Types with null values on ' + i
plt.title(title)
<ipython-input-293-731d7be0c742>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket") <ipython-input-293-731d7be0c742>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket") <ipython-input-293-731d7be0c742>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket") <ipython-input-293-731d7be0c742>:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.catplot(data=temp,x='NAME_HOUSING_TYPE',kind='count',palette="rocket")
Observations:
- Every unit/size-related field starting from 'APARTMENTS_AVG' till 'WALLSMATERIAL_MODE' are highly correlated with each other, with no value less than 0.7.
- I've plotted 4 sample features of these normalized information on where the customer lives, and it seems that the missing values are mostly coming from 'houses/apartments' -- take note, the 'rented' apartment is a different type (with it having a low count of missing values). Come to think of it, if we are living in a house for so long, we usually don't bother knowing its floor size or maybe we've forgotten about it - which is very much understandable. (unless you are going to renovate, rent-out or sell the place!)
- The housing type that scored second highest with null values is 'with parents', and again it is common that someone living with parents won't actually bother knowing the land and floor area of their current residence. (from where I come from, this is usually the case)
Decisions:
- Since these are all residence-related sizes, I will consolidate them into one field to account for 'house expenditures'.
# CREATE NEW COLUMN : EXPENDITURE_HOUSE
list_col_new_house = ['NAME_HOUSING_TYPE',
'APARTMENTS_AVG',
'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG',
'COMMONAREA_AVG',
'ELEVATORS_AVG',
'ENTRANCES_AVG',
'FLOORSMAX_AVG',
'FLOORSMIN_AVG',
'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG',
'APARTMENTS_MODE',
'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE',
'YEARS_BUILD_MODE',
'COMMONAREA_MODE',
'ELEVATORS_MODE',
'ENTRANCES_MODE',
'FLOORSMAX_MODE',
'FLOORSMIN_MODE',
'LANDAREA_MODE',
'LIVINGAPARTMENTS_MODE',
'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAREA_MODE',
'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI',
'ELEVATORS_MEDI',
'ENTRANCES_MEDI',
'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI',
'LANDAREA_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI',
'NONLIVINGAPARTMENTS_MEDI',
'NONLIVINGAREA_MEDI',
'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE',
'TOTALAREA_MODE',
'WALLSMATERIAL_MODE',
'EMERGENCYSTATE_MODE']
fig,ax = plt.subplots(figsize=(16,8))
sns.countplot(data=appl_train,x='REGION_POPULATION_RELATIVE',hue='TARGET',ax=ax);
plt.xticks(rotation=90)
plt.show()
sns.catplot(data=appl_train,x='WEEKDAY_APPR_PROCESS_START',hue='TARGET',kind='count');
plt.xticks(rotation=90);
fig,ax = plt.subplots(figsize=(10,3))
sns.countplot(data=appl_train,x='HOUR_APPR_PROCESS_START',hue='TARGET',ax=ax);
plt.xticks(rotation=90)
plt.show()
Observations:
- REGION_POPULATION_RELATIVE has min val = 0.00029 and max val = 0.072508. Normalized values usually range from 0 to 1, but for this sample, the maximum value is way too far from 1. Whatever values this may serve, there is a noticeable number of defaults as the value gets higher.
- Majority of the customers apply during weekdays, with a few on weekends. The trend on customers who weren't able to repay the loan is similar with that of those who did.
- Suspiciously, there are people applying for a loan account as early as 3am, and it gets denser throughout the day. Do note that those who defaulted on their loan has a similar pattern with those having good records.
Decisions:
- Use REGION_POPULATION_RELATIVE as a feature model. The larger the population on a certain region, the more chances of getting a customer with a bad record.
- Drop WEEKDAY_APPR_PROCESS_START. I can't say that there's little chance of defaulting if the customer opens an account during Sundays, because the account opening overall on that day is relatively low compared to other days.
- Drop HOUR_APPR_PROCESS_START. Similar reasoning with above.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START'])
temp=appl_train[['TARGET','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']]
print(temp.info())
sns.pairplot(temp,hue='TARGET');
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TARGET 307511 non-null int64 1 EXT_SOURCE_1 134133 non-null float64 2 EXT_SOURCE_2 306851 non-null float64 3 EXT_SOURCE_3 246546 non-null float64 dtypes: float64(3), int64(1) memory usage: 9.4 MB None
Observations:
- These 3 fields are external data source score fields.
- Base on the plot above, those who were able to pay and did not pay can have scores fairly distributed on EXT_SOURCE fields, but it is quite evident that on the lower end of the normalized score mark (0.0-0.5), customers who paid (target=0, blue color) are much less prominent than those who didn't (target=1, orange color)... and vice versa.
- All 3 fields have missing values.
Decisions:
- Use EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 as model features.
- Complete the fields by filling up the null values base on their mean EXT_SOURCE.
# COMPLETE COLUMNS : Base on mean EXT_SOURCE
list_col_fill_ext = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
appl_train[['AMT_INCOME_TOTAL','NAME_INCOME_TYPE','DAYS_EMPLOYED','OCCUPATION_TYPE','ORGANIZATION_TYPE']].describe(include='all')
| AMT_INCOME_TOTAL | NAME_INCOME_TYPE | DAYS_EMPLOYED | OCCUPATION_TYPE | ORGANIZATION_TYPE | |
|---|---|---|---|---|---|
| count | 3.075110e+05 | 307511 | 307511.000000 | 211120 | 307511 |
| unique | NaN | 8 | NaN | 18 | 58 |
| top | NaN | Working | NaN | Laborers | Business Entity Type 3 |
| freq | NaN | 158774 | NaN | 55186 | 67992 |
| mean | 1.687979e+05 | NaN | 63815.045904 | NaN | NaN |
| std | 2.371231e+05 | NaN | 141275.766519 | NaN | NaN |
| min | 2.565000e+04 | NaN | -17912.000000 | NaN | NaN |
| 25% | 1.125000e+05 | NaN | -2760.000000 | NaN | NaN |
| 50% | 1.471500e+05 | NaN | -1213.000000 | NaN | NaN |
| 75% | 2.025000e+05 | NaN | -289.000000 | NaN | NaN |
| max | 1.170000e+08 | NaN | 365243.000000 | NaN | NaN |
#Creating a Function for Threshold of the field, to find outliers
def get_thresh(df,field):
""" Outliers are usually > 3 standard deviations away from the mean. """
ave=np.mean(df[field])
sdev=np.std(df[field])
threshold=round(ave+(3*sdev),2)
print('Threshold for',field,':',threshold)
return threshold
thresh_income = get_thresh(appl_train,'AMT_INCOME_TOTAL')
anomaly_emp = int(appl_train['DAYS_EMPLOYED'][appl_train['DAYS_EMPLOYED']>0].unique())
temp_orig=appl_train[['AMT_INCOME_TOTAL','DAYS_EMPLOYED']]
temp_no_outliers=appl_train[['AMT_INCOME_TOTAL','DAYS_EMPLOYED']][(appl_train.AMT_INCOME_TOTAL<thresh_income)&(appl_train['DAYS_EMPLOYED']<=0)]
print('Anomalous data for DAYS_EMPLOYED :',anomaly_emp)
Threshold for AMT_INCOME_TOTAL : 880166.2 Anomalous data for DAYS_EMPLOYED : 365243
<ipython-input-232-5922d575a623>:2: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.) anomaly_emp = int(appl_train['DAYS_EMPLOYED'][appl_train['DAYS_EMPLOYED']>0].unique())
def plotdist(df,f1,f2):
f,axes = plt.subplots(1,2,figsize=(10,3))
sns.distplot(df[[f1]],ax=axes[0]).set_title(f1)
plt.xticks(rotation=75)
sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
plt.xticks(rotation=75)
plt.tight_layout()
# AMT_INCOME_TOTAL, DAYS_EMPLOYED --> ORIGINAL VALUE WITH OUTLIERS
plotdist(temp_orig,'AMT_INCOME_TOTAL','DAYS_EMPLOYED')
<ipython-input-233-0e8e66562f70>:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(df[[f1]],ax=axes[0]).set_title(f1) <ipython-input-233-0e8e66562f70>:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
# AMT_INCOME_TOTAL, DAYS_EMPLOYED --> OUTLIERS REMOVED
plotdist(temp_no_outliers,'AMT_INCOME_TOTAL','DAYS_EMPLOYED')
<ipython-input-233-0e8e66562f70>:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(df[[f1]],ax=axes[0]).set_title(f1) <ipython-input-233-0e8e66562f70>:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(df[[f2]],ax=axes[1]).set_title(f2)
# NUMBER OF INDIVIDUALS HAVING THE DAYS EMPLOYED ANOMALOUS DATA
len(appl_train[appl_train.DAYS_EMPLOYED==anomaly_emp])
55374
unpaid=appl_train[appl_train.TARGET==1]
sns.catplot(data=unpaid,x='NAME_INCOME_TYPE',kind='count');
plt.xticks(rotation=90);
plt.title('Income Stream of DEFAULTED LOANS');
fig,ax = plt.subplots(figsize=(15,5))
sns.countplot(data=unpaid,x='ORGANIZATION_TYPE',ax=ax);
plt.xticks(rotation=90)
plt.title('Organization of people with DEFAULTED LOANS')
plt.show()
def plotstrip(df,xval,yval,hueval,yfig):
fig,ax = plt.subplots(figsize=(15,yfig))
sns.stripplot(x=xval,y=yval,hue=hueval,data=df,alpha=0.5,jitter=0.8,dodge=True,ax=ax).set_title(yval);
plt.legend(bbox_to_anchor=(1.05, 1))
plt.show()
temp=appl_train[appl_train.OCCUPATION_TYPE.isnull()]
print('Individuals that left OCCUPATION_TYPE field blank:')
print(temp.NAME_INCOME_TYPE.value_counts())
Individuals that left OCCUPATION_TYPE field blank: NAME_INCOME_TYPE Pensioner 55357 Working 24920 Commercial associate 12297 State servant 3787 Unemployed 22 Student 5 Businessman 2 Maternity leave 1 Name: count, dtype: int64
plotstrip(appl_train[appl_train.AMT_INCOME_TOTAL<thresh_income],'AMT_INCOME_TOTAL','NAME_INCOME_TYPE','TARGET',5)
plotstrip(appl_train[appl_train.AMT_INCOME_TOTAL<thresh_income],'AMT_INCOME_TOTAL','ORGANIZATION_TYPE','TARGET',20)
Observations:
- AMT_INCOME_TOTAL has an outlier. As per the data: the 75th percentile is equal to 202,500, while the maximum value is equal to a whopping 117,000,000; It is quite a big difference as it means that 75% of our customers are already making a total income of 202,500 or below - the remaining 25% has a total income higher than 202,500.
- AMT_INCOME_TOTAL has a wide range of values, from min val 25,650 to max val 117,000,000.
- DAYS_EMPLOYED is having inconsistent data. It contains negative values (since it is in the form of 'days' relative to the application), then suddenly there is one very big positive number (365243 days - which equates to 1000 years!) with a significant number of individuals having it.
- NAME_INCOME_TYPE: The 'working' category is the most dense in terms of low wage high default customers. We also have very few samples on 'unemployed', 'student', 'maternity leave' and 'businessman'. Interesting to see that the 'businessman' category has an above average total income and has a high chance that they will maintain good credit scoring.
- NAME_INCOME_TYPE 'Pensioner' is almost equal to the DAYS_EMPLOYED anomaly data count! And it is also almost equal to the ORGANIZATION_TYPE value 'XNA' ! It just means that the '1000 years' employment duration was made for retired people.
- OCCUPATION_TYPE has 31% of its values missing. Majority of the null OCCUPATION_TYPE are from pensioners (which totally makes sense), then other big chunks are from working people, commercial associate, state servant, etc. My take here is that the choices on occupation type field is limited which led these number of working people to leaving it blank.
- The ORGANIZATION_TYPE is pretty diverse regarding where do these customers work. But base on the histogram, the category where the defaulting individuals are dominant are those in Business Entity Type 3, self-employed, and XNA.
Decisions:
- AMT_INCOME_TOTAL: Create new feature 'IncomeBand' to convert into ordinal income band category.
- DAYS_EMPLOYED: Convert anomalous data 365243 days to -29200 days (equal to 80 yrs). I think it's safe to say that an individual is retired by then after working for 80 yrs. Create new feature 'years employed' to convert number of employed 'days' to 'years' for easier reading.
- OCCUPATION_TYPE: to drop. We can leverage on how much is their total stream of income and how they acquire it.
- NAME_INCOME_TYPE: to drop. We can usually identify if someone is working, or a student, or already retired, etc., base on the age and ORGANIZATION_TYPE.
- ORGANIZATION_TYPE: Convert categorical text to numerical. Use feature for modeling.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['OCCUPATION_TYPE','NAME_INCOME_TYPE'])
# CREATE NEW COLUMN : INCOME_BAND
list_col_new_income = ['AMT_INCOME_TOTAL']
# CONVERT COLUMN : 365243 to -29200
list_col_conv_daysemp = ['DAYS_EMPLOYED']
# CREATE NEW COLUMN : YEARS_EMPLOYED
list_col_new_yrsemp = ['DAYS_EMPLOYED']
# CONVERT COLUMN :
list_col_conv_org = ['ORGANIZATION_TYPE']
And for the final set of variables, I tagged it as 'trustworthiness' to account for trusthworthy credentials, or environmental/ unconsious factors that may shape an individual's behavior. It is quite a lot, so let's break it down again per function. (Note: the excel sheet above is incomplete as this notebook keeps crashing whenever I paste the remaining fields)
appl_train[['CODE_GENDER','NAME_TYPE_SUITE','NAME_EDUCATION_TYPE','DAYS_REGISTRATION','DAYS_ID_PUBLISH']].describe(include='all')
| CODE_GENDER | NAME_TYPE_SUITE | NAME_EDUCATION_TYPE | DAYS_REGISTRATION | DAYS_ID_PUBLISH | |
|---|---|---|---|---|---|
| count | 307511 | 306219 | 307511 | 307511.000000 | 307511.000000 |
| unique | 3 | 7 | 5 | NaN | NaN |
| top | F | Unaccompanied | Secondary / secondary special | NaN | NaN |
| freq | 202448 | 248526 | 218391 | NaN | NaN |
| mean | NaN | NaN | NaN | -4986.120328 | -2994.202373 |
| std | NaN | NaN | NaN | 3522.886321 | 1509.450419 |
| min | NaN | NaN | NaN | -24672.000000 | -7197.000000 |
| 25% | NaN | NaN | NaN | -7479.500000 | -4299.000000 |
| 50% | NaN | NaN | NaN | -4504.000000 | -3254.000000 |
| 75% | NaN | NaN | NaN | -2010.000000 | -1720.000000 |
| max | NaN | NaN | NaN | 0.000000 | 0.000000 |
g = sns.FacetGrid(appl_train,row='CODE_GENDER',col='NAME_EDUCATION_TYPE',hue='TARGET',height=4) # nominal
g.map(plt.scatter,'DAYS_ID_PUBLISH','DAYS_REGISTRATION',alpha=0.5,edgecolor='k',linewidth=0.5) # continuous
fig = g.fig
fig.set_size_inches(25,10)
fig.subplots_adjust(top=0.85,wspace=0.3)
fig.suptitle('Gender - Educational Attainment - Registration Change - ID Change - Credit Ranking',fontsize=20)
l = g.add_legend(title='Credit Score')
paid = appl_train[appl_train.TARGET==0]
unpaid = appl_train[appl_train.TARGET==1]
f,axes = plt.subplots(1,2,figsize=(10,3))
sns.kdeplot(x=paid['DAYS_ID_PUBLISH'], y=paid['DAYS_REGISTRATION'], cmap="Blues", shade=True, ax=axes[0]).set_title('Paid')
sns.kdeplot(x=unpaid['DAYS_ID_PUBLISH'], y=unpaid['DAYS_REGISTRATION'], cmap="Reds", shade=True, ax=axes[1]).set_title('Unpaid')
sns.set_style('whitegrid')
plt.tight_layout()
<ipython-input-246-e3095a443fec>:5: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(x=paid['DAYS_ID_PUBLISH'], y=paid['DAYS_REGISTRATION'], cmap="Blues", shade=True, ax=axes[0]).set_title('Paid')
<ipython-input-246-e3095a443fec>:6: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(x=unpaid['DAYS_ID_PUBLISH'], y=unpaid['DAYS_REGISTRATION'], cmap="Reds", shade=True, ax=axes[1]).set_title('Unpaid')
appl_train.NAME_EDUCATION_TYPE.value_counts()
| count | |
|---|---|
| NAME_EDUCATION_TYPE | |
| Secondary / secondary special | 218391 |
| Higher education | 74863 |
| Incomplete higher | 10277 |
| Lower secondary | 3816 |
| Academic degree | 164 |
f,axes = plt.subplots(1,2,figsize=(10,5),sharex=True)
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==0],hist=False,color="b",kde_kws={"shade":True},ax=axes[0]).set_title('Target == 0 (Paid)');
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==1],hist=False,color="r",kde_kws={"shade":True},ax=axes[1]).set_title('Target == 1 (Unpaid)');
<ipython-input-248-58d2041d4562>:2: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==0],hist=False,color="b",kde_kws={"shade":True},ax=axes[0]).set_title('Target == 0 (Paid)');
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
<ipython-input-248-58d2041d4562>:3: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_BIRTH']][appl_train.TARGET==1],hist=False,color="r",kde_kws={"shade":True},ax=axes[1]).set_title('Target == 1 (Unpaid)');
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
Observations:
- We have 3 types of CODE_GENDER in our sample: Male, female and XNA. Male and female customers are pretty dense, both having same chances of defaulting relative to their total count. We have a very small set of XNA (non-binary?) people, and base on our sample, 100% of them have good credit record.
- For NAME_EDUCATION_TYPE, majority of our customers - whether male, female, or XNA - are having 'secondary special' or 'higher education', and most of the defaulting customers came from this educational background. 'Incomplete higher' and 'Lower secondary' are lesser with also less defaulting customers. And the least count are those with 'Academic degree' which got the least number of unpaid customers.
- The number of days in which the customer changed his/her identity document and/or application registration seemed to have the same pattern for those having good credit and bad credit.
- The DAYS_BIRTH of a customer (or rather the 'age') seem as expected. We can note that the x-axis values are negative - because it is recorded relative prior to the loan application - and very high (because it is in 'days' format rather than in 'years'). As per the graph, younger people tend to default more. As people gets older, they seem to be able to have a higher chance of paying off the loan.
Decisions:
- Use CODE_GENDER as a model feature. Convert the categorical text to numeric.
- Use NAME_EDUCATION_TYPE as a model feature. Convert the categorical text to numeric.
- Drop DAYS_REGISTRATION, DAYS_ID_PUBLISH, NAME_TYPE_SUITE. I generated a KDE plot and I hardly see the difference.
- Create new feature 'age' from DAYS_BIRTH. Compute for the age in years rather than using the day count.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['NAME_TYPE_SUITE','DAYS_REGISTRATION','DAYS_ID_PUBLISH'])
# CONVERT COLUMN :
list_col_conv_gender = ['CODE_GENDER']
# CONVERT COLUMN :
list_col_conv_edu = ['NAME_EDUCATION_TYPE']
# CREATE NEW COLUMN : AGE
list_col_new_age = ['DAYS_BIRTH']
temp1=['FLAG_MOBIL',
'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE',
'FLAG_PHONE',
'FLAG_EMAIL']
temp2=['FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']
appl_train[temp1+temp2].describe()
| FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | ... | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 |
| mean | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | ... | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 |
| std | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | ... | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 26 columns
#Sums up items per row across all columns. Returns df with new sum column and catplot.
def featsum(cols,newcol):
sample_count=appl_train[cols].sum(axis=1)
sample = appl_train.copy()
sample[newcol]=sample_count
sns.catplot(data=sample,x=newcol,hue='TARGET',kind='count');
featsum(temp1,'FlagContact')
featsum(temp2,'FlagDocu')
Observations:
- All 26 'FLAG-' (flag for contacts and documents) variables have nominal categorical values: 1='YES', 0='NO'.
Decisions:
- Create new feature 'FlagContact' to account for the total flag-contact variable recorded per individual.
- Create new feature 'FlagDocu' to account for the total flag-document variable recorded per individual.
# CREATE NEW COLUMN : FLAG_CONTACTS
list_col_new_flagCont = [
'FLAG_MOBIL',
'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE',
'FLAG_PHONE',
'FLAG_EMAIL']
# CREATE NEW COLUMN : FLAG_DOCS
list_col_new_flagDoc = [
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']
#Function for creating categorical plots
def plotcat(x,r):
sns.catplot(data=appl_train,x=x,hue='TARGET',kind='count');
plt.xticks(rotation=r);
plotcat('REGION_RATING_CLIENT',0)
plotcat('REGION_RATING_CLIENT_W_CITY',0)
print('Correlation:',round(appl_train['REGION_RATING_CLIENT_W_CITY'].corr(appl_train['REGION_RATING_CLIENT']),2))
Correlation: 0.95
temp=[
'REG_REGION_NOT_LIVE_REGION',
'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY',
'LIVE_CITY_NOT_WORK_CITY'
]
featsum(temp,'FlagAddr')
Observations:
- REGION_RATING_CLIENT & REGION_RATING_CLIENT_W_CITY is highly correlated, scoring 0.95. This is Home Credit's rating of the region where client lives (values = 1,2,3). Majority of the sample obtains a value of '2', and the defaulting count is visible
- The 4 fields related to flags if the customer's certain (registered) address does not match another given address, the values being: 0='SAME', 1='DIFFERENT'. Most of the customers registered their addresses accurately (scoring 0). And since '0' has a very high frequency, there is a greater chance of encountering different kinds of people that may default on their loan.
Decisions:
- Use REGION_RATING_CLIENT as a feature model.
- Drop REGION_RATING_CLIENT_W_CITY since it is highly correlated to region rating alone.
- Create new ordinal feature 'flag address' to indicate the sum of all 4 flag-address-related fields. The higher the score, the more inaccurate address declarations were made by the individual.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['REGION_RATING_CLIENT_W_CITY'])
# CREATE NEW COLUMN : FLAG_ADDR
list_col_new_flagAddr = ['REG_REGION_NOT_LIVE_REGION','LIVE_REGION_NOT_WORK_REGION','REG_CITY_NOT_LIVE_CITY','LIVE_CITY_NOT_WORK_CITY']
appl_train[['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE']].describe()
| OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|
| count | 306490.000000 | 307511.000000 | 306490.000000 | 306490.000000 | 307510.000000 |
| mean | 1.422245 | 0.142944 | 1.405292 | 0.100049 | -962.858788 |
| std | 2.400989 | 0.446033 | 2.379803 | 0.362291 | 826.808487 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 |
| 75% | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 |
| max | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 |
#Sums up items per row across all columns.Returns df with new sum column and violinplot.
def featsumviolin(df,cols1,cols2,newcol1,newcol2):
sample_count1=df[cols1].sum(axis=1)
sample_count2=df[cols2].sum(axis=1)
sample = appl_train.copy()
sample[newcol1]=sample_count1
sample[newcol2]=sample_count2
fig,ax = plt.subplots(figsize=(10,5))
sns.violinplot(data=sample,hue='TARGET',x=newcol1,y=newcol2,split=True,inner='quart',linewidth=1.3,palette={1:"#FF9999", 0:"white"});
featsumviolin(appl_train[(appl_train.OBS_30_CNT_SOCIAL_CIRCLE<348)&(appl_train.OBS_60_CNT_SOCIAL_CIRCLE<344)],
['DEF_30_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE'],
['OBS_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE'],
'DEF_NEW','OBS_NEW')
sns.set(palette="muted",color_codes=True)
f,axes = plt.subplots(2,2,figsize=(10,5),sharex=True)
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],kde=False,color="b",ax=axes[0,0]).set_title('Target == 0 (Paid)')
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],hist=False,color="g",kde_kws={"shade":True},ax=axes[0,1]).set_title('Target == 0 (Paid)')
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],kde=False,color="r",ax=axes[1,0]).set_title('Target == 1 (Unpaid)')
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],hist=False,color="m",kde_kws={"shade":True},ax=axes[1,1]).set_title('Target == 1 (Unpaid)')
plt.tight_layout()
<ipython-input-266-9858b9742ccf>:4: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],kde=False,color="b",ax=axes[0,0]).set_title('Target == 0 (Paid)')
<ipython-input-266-9858b9742ccf>:5: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==0],hist=False,color="g",kde_kws={"shade":True},ax=axes[0,1]).set_title('Target == 0 (Paid)')
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
<ipython-input-266-9858b9742ccf>:7: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],kde=False,color="r",ax=axes[1,0]).set_title('Target == 1 (Unpaid)')
<ipython-input-266-9858b9742ccf>:8: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(appl_train[['DAYS_LAST_PHONE_CHANGE']][appl_train.TARGET==1],hist=False,color="m",kde_kws={"shade":True},ax=axes[1,1]).set_title('Target == 1 (Unpaid)')
/usr/local/lib/python3.10/dist-packages/seaborn/distributions.py:2496: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.
kdeplot(**{axis: a}, ax=ax, color=kde_color, **kde_kws)
Observations:
- There are 2 variables related to 30 days past due (30 DPD), and another 2 for 60 days past due (60 DPD). I have combined them and plotted using a violinplot as seen above. From what I understand here, those people that defaulted, and even those who didn't, have equal possibilities of missing a payment on their due date. Though surprisingly, majority of those who paid on a much later date got their loans cleared off.
Note: A payment status of 30-days late means that payment is between 30-59 days past the payment due date (30 DPD). A payment status of 60-days late means that payment is between 60-89 days past the payment due date (60 DPD).
- I plotted separately the days in which a customer last changed their phone (relative to their application) for those having good and bad loan status records. The result is pretty much expected, as most people tend to change phones in a span of 2-3 years only (hence the distribution is skewed to the right). We can see that the people count gets lesser and lesser as the usage of their phone gets longer than the average.
Decisions:
- Drop the DPD variables.
- Drop DAYS_LAST_PHONE_CHANGE.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE'])
appl_train[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']].describe()
| AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|
| count | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
for i in ['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK']:
for j in ['AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']:
fig,ax = plt.subplots(figsize=(10,3))
sns.stripplot(data=appl_train[appl_train.AMT_REQ_CREDIT_BUREAU_QRT<261],x=i,y=j,hue='TARGET',alpha=0.5,jitter=0.3,dodge=True,ax=ax)
plt.show()
Observations:
- What I can see here is that, majority of the sample don't really have queries (0.0) regardless what time period it is, and regardless of their repayment status. The most common enquiry count ranges from 0 to 2, but then the behavior is similar to both TARGET==0 and TARGET==1. Do take note that we only have a few individuals (about 8%) in our sample population that defaulted (TARGET==1).
Decisions:
- Drop the credit bureau enquiry fields.
# COLUMNS TO BE DROPPED
list_col_drop.extend(['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR'])
6. Data Wrangling¶
After (finally) checking all our fields, it is time to proceed with data wrangling - also known as the data cleaning process.
Complete the null values of the following features:
- 'EXT_SOURCE_1'
- 'EXT_SOURCE_2'
- 'EXT_SOURCE_3'
#Essential Functions
def getmean(df,ls_cols):
list_mean = []
for i in ls_cols:
mean_val = df[i].mean()
list_mean.append(mean_val)
return list_mean
def fill_ave_ext(df,ls_cols):
list_mean = getmean(df,ls_cols) # mean of EXT_SOURCE_*
ctr=0
for i in ls_cols:
df[i] = df[i].fillna(list_mean[ctr])
ctr+=1
return df
# Fill in the training set
fill_ave_ext(appl_train,list_col_fill_ext);
# Fill in the testing set
fill_ave_ext(appl_test,list_col_fill_ext);
# NO MORE NULL VALUES FOR 'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'
print(appl_train[list_col_fill_ext].info())
print(appl_test[list_col_fill_ext].info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EXT_SOURCE_1 307511 non-null float64 1 EXT_SOURCE_2 307511 non-null float64 2 EXT_SOURCE_3 307511 non-null float64 dtypes: float64(3) memory usage: 7.0 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EXT_SOURCE_1 48744 non-null float64 1 EXT_SOURCE_2 48744 non-null float64 2 EXT_SOURCE_3 48744 non-null float64 dtypes: float64(3) memory usage: 1.1 MB None
Complete the null values for 'CNT_FAM_MEMBERS'.
#Complete the null values for 'CNT_FAM_MEMBERS'.
def fill_0_fam(df,ls_cols):
df[ls_cols] = df[ls_cols].fillna(0)
return df
fill_0_fam(appl_train,'CNT_FAM_MEMBERS');
fill_0_fam(appl_test,'CNT_FAM_MEMBERS');
# NO MORE NULL VALUES FOR 'CNT_FAM_MEMBERS'
print(appl_train[list_col_fill_fam].info())
print(appl_test[list_col_fill_fam].info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CNT_FAM_MEMBERS 307511 non-null float64 dtypes: float64(1) memory usage: 2.3 MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CNT_FAM_MEMBERS 48744 non-null float64 dtypes: float64(1) memory usage: 380.9 KB None
Convert the anomaly data in 'DAYS_EMPLOYED'.
# ANOMALY DATA COUNT BEFORE CONVERSION
print('Train set :',len(appl_train[appl_train.DAYS_EMPLOYED==365243]))
print('Test set :',len(appl_test[appl_test.DAYS_EMPLOYED==365243]))
Train set : 55374 Test set : 9274
def conv_daysemp(df,ls_cols):
df[ls_cols[0]].replace(to_replace=365243,value=-29200,inplace=True)
return df
conv_daysemp(appl_train,list_col_conv_daysemp);
conv_daysemp(appl_test,list_col_conv_daysemp);
<ipython-input-276-201878203c00>:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df[ls_cols[0]].replace(to_replace=365243,value=-29200,inplace=True)
# ANOMALY DATA COUNT AFTER CONVERSION
print('Train set :',len(appl_train[appl_train.DAYS_EMPLOYED==365243]))
print('Test set :',len(appl_test[appl_test.DAYS_EMPLOYED==365243]))
Train set : 0 Test set : 0
Convert the categorical text columns to numerical ones for:
- CODE_GENDER
- NAME_EDUCATION_TYPE
- ORGANIZATION_TYPE
# BEFORE CONVERSION
print(appl_train[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
print(appl_test[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
CODE_GENDER NAME_EDUCATION_TYPE ORGANIZATION_TYPE 0 M Secondary / secondary special Business Entity Type 3 1 F Higher education School 2 M Secondary / secondary special Government CODE_GENDER NAME_EDUCATION_TYPE ORGANIZATION_TYPE 0 F Higher education Kindergarten 1 M Secondary / secondary special Self-employed 2 M Higher education Transport: type 3
def conv_gender(df,ls_cols):
df[ls_cols[0]] = df[ls_cols[0]].map({'XNA':0,'M':1,'F':2}).astype(int)
return df
def conv_education(df,ls_cols):
temp_dict = {
'Lower secondary':1,
'Secondary / secondary special':2,
'Incomplete higher':3,
'Higher education':4,
'Academic degree':5
}
df[ls_cols[0]] = df[ls_cols[0]].map(temp_dict).astype(int)
return df
def conv_org(df,ls_cols):
ls_ctr=[i for i in np.arange(1,len(ls_cols)+1)]
temp_dict = dict(zip(ls_cols,ls_ctr))
df['ORGANIZATION_TYPE'] = df['ORGANIZATION_TYPE'].map(temp_dict).astype(int)
return df
conv_gender(appl_train,list_col_conv_gender);
conv_gender(appl_test,list_col_conv_gender);
conv_education(appl_train,list_col_conv_edu);
conv_education(appl_test,list_col_conv_edu);
orgtype = sorted(appl_train.ORGANIZATION_TYPE.unique())
conv_org(appl_train,orgtype);
conv_org(appl_test,orgtype);
# AFTER CONVERSION
print(appl_train[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
print(appl_test[['CODE_GENDER','NAME_EDUCATION_TYPE','ORGANIZATION_TYPE']].head(3))
CODE_GENDER NAME_EDUCATION_TYPE ORGANIZATION_TYPE 0 1 2 6 1 2 4 40 2 1 2 12 CODE_GENDER NAME_EDUCATION_TYPE ORGANIZATION_TYPE 0 2 4 29 1 1 2 43 2 1 4 55
Hypothesis Testing¶
7.1 Debt-to-Income Ratio and Income per Child¶
Hypothesis: Users with higher education levels have higher average income.¶
Null Hypothesis (H0):
There is no significant difference in the Debt-to-Income Ratio based on the Income per Child.
Alternative Hypothesis (H1):
The Debt-to-Income Ratio significantly varies based on the Income per Child.
Test:
Pearson Correlation Coefficient.
Reason for chosing this Test:
The Pearson correlation coefficient is appropriate for assessing the strength and direction of the linear relationship between two continuous variables, in this case, the Debt-to-Income Ratio and Income per Child.
# Create new features
appl_train['debt_income_ratio'] = appl_train['AMT_CREDIT'] / appl_train['AMT_INCOME_TOTAL']
appl_train['income_per_child'] = appl_train['AMT_INCOME_TOTAL'] / (1 + appl_train['CNT_CHILDREN'])
# Step 1: Define significance level
alpha = 0.05
# Step 2: Calculate the Pearson correlation and p-value
correlation, p_value = stats.pearsonr(appl_train['debt_income_ratio'], appl_train['income_per_child'])
# Step 3: Print the results
print(f"Pearson correlation coefficient: {correlation}")
print(f"p-value: {p_value}")
# Step 4: Hypothesis Testing
if p_value <= alpha:
print("Reject the null hypothesis: Significant correlation exists.")
else:
print("Fail to reject the null hypothesis: No significant correlation exists.")
Pearson correlation coefficient: -0.13704002646062904 p-value: 0.0 Reject the null hypothesis: Significant correlation exists.
<ipython-input-284-795f3b4a3d3f>:9: DeprecationWarning: Please import `pearsonr` from the `scipy.stats` namespace; the `scipy.stats.stats` namespace is deprecated and will be removed in SciPy 2.0.0. correlation, p_value = stats.pearsonr(appl_train['debt_income_ratio'], appl_train['income_per_child'])
Debt-to-Income Ratio by Default Status¶
Visualize the distribution of the debt-to-income ratio across default and non-default users.
plt.figure(figsize=(8,4))
sns.histplot(x='debt_income_ratio', hue='TARGET', data=appl_train, bins=50, kde=True, palette=['blue', 'white'])
plt.xlim(0, 20)
plt.title('Debt to Income Ratio by Default Status')
plt.xlabel('Debt to Income Ratio')
Text(0.5, 0, 'Debt to Income Ratio')
plt.figure(figsize=(8, 6))
sns.scatterplot(x='debt_income_ratio', y='income_per_child', data=appl_train)
plt.title('Scatter Plot of Debt-to-Income Ratio vs. Income per Child')
plt.xlabel('Debt-to-Income Ratio')
plt.ylabel('Income per Child')
plt.axhline(0, color='red', linestyle='--') # Optional: Add a horizontal line for reference
plt.axvline(0, color='red', linestyle='--') # Optional: Add a vertical line for reference
<matplotlib.lines.Line2D at 0x7cfbb0933490>
Hypothesis Conclusion:¶
Based on the analysis conducted using the Pearson correlation test, we calculated a Pearson correlation coefficient of approximately -0.137 between the debt-to-income ratio and income per child, indicating a weak negative correlation. The p-value obtained is 0.0, which is significantly lower than our chosen significance level of 0.05. Therefore, we reject the null hypothesis and conclude that there is a significant correlation between the debt-to-income ratio and income per child. This suggests that as the debt-to-income ratio changes, there is a statistically significant change in the income per child.
7.2 Income Level and Default Likelihood¶
Hypothesis: Users with lower income levels are more likely to default.¶
Null Hypothesis (H0):
There is no significant difference in income levels between users who default and those who do not.
Alternative Hypothesis (H1):
There is a significant difference in income levels between users who default and those who do not.
Test:
Independent Samples T-Test.
Reason for chosing this Test:
Independent Samples T-Test is appropriate for comparing the means of two independent group users who default and those who do not—to determine if income levels significantly differ between the two groups.
# Split income data based on target variable
defaulted_income = appl_train[appl_train['TARGET'] == 1]['AMT_INCOME_TOTAL']
non_defaulted_income = appl_train[appl_train['TARGET'] == 0]['AMT_INCOME_TOTAL']
# Perform T-test
t_stat, p_val = ttest_ind(defaulted_income, non_defaulted_income, equal_var=False)
print("T-Test for Income Levels and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val:.4f}")
T-Test for Income Levels and Default Likelihood T-Statistic: -0.7307, P-Value: 0.4650
plt.figure(figsize=(12, 6))
# Box Plot
plt.subplot(1, 2, 1)
sns.boxplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Box Plot of Income Levels by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Income Level (Total Income)')
# Violin Plot
plt.subplot(1, 2, 2)
sns.violinplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Violin Plot of Income Levels by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Income Level (Total Income)')
plt.tight_layout()
<ipython-input-288-03bd8fa7ddb1>:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel') <ipython-input-288-03bd8fa7ddb1>:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.violinplot(x='TARGET', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
Hypothesis Conclusion:¶
The Independent Samples T-Test revealed a T-Statistic of -0.7307 and a P-Value of 0.4650. Since the P-Value is significantly higher than the common significance level of 0.05, we fail to reject the null hypothesis (H0).
This indicates that there is no significant difference in income levels between users who default and those who do not. Therefore, we conclude that income level does not appear to be a significant factor influencing loan default likelihood in this analysis.
7.3 Education Level and Income¶
Hypothesis: Users with higher education levels have higher average income.¶
Null Hypothesis (H0):
There is no significant difference in income based on education level.
Alternative Hypothesis (H1):
Income significantly varies by education level.
Test:
ANOVA (Analysis of Variance).
Reason for chosing this Test:
ANOVA (Analysis of Variance) is chosen to assess whether there are significant differences in average income across multiple education levels, as it effectively compares means among three or more independent groups.
education_groups = [group['AMT_INCOME_TOTAL'].values for name, group in appl_train.groupby('NAME_EDUCATION_TYPE')]
# Perform ANOVA
f_stat, p_val = f_oneway(*education_groups)
# Print ANOVA results
print("ANOVA for Income by Education Level")
print(f"F-Statistic: {f_stat:.4f}, P-Value: {p_val:.4f}")
# Visualization: Box Plot for Income by Education Level
plt.figure(figsize=(12, 6))
sns.boxplot(x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
plt.title('Box Plot of Income by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Income Level (Total Income)')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout()
ANOVA for Income by Education Level F-Statistic: 753.2160, P-Value: 0.0000
<ipython-input-289-7505e7a9cf56>:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='NAME_EDUCATION_TYPE', y='AMT_INCOME_TOTAL', data=appl_train, palette='pastel')
Hypothesis Conclusion:¶
The ANOVA analysis for income by education level yielded an F-Statistic of 753.2160 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).
This indicates that there is a significant difference in income based on education level. Consequently, we conclude that users with varying education levels have different average income levels, supporting the hypothesis that higher education levels are associated with higher average incomes.
7.4 Employment Length and Default Likelihood¶
Hypothesis: Users with shorter employment durations are more likely to default.¶
Null Hypothesis (H0):
There is no significant difference in employment duration between users who default and those who do not.
Alternative Hypothesis (H1):
There is a significant difference in employment duration between users who default and those who do not.
Test:
Independent Samples T-Test.
Reason for chosing this Test:
The Independent Samples T-Test is suitable for comparing the means of two independent groups users who default and those who do not—to determine if there is a significant difference in employment length between the two groups.
defaulted_emp = appl_train[appl_train['TARGET'] == 1]['DAYS_EMPLOYED'].abs()
non_defaulted_emp = appl_train[appl_train['TARGET'] == 0]['DAYS_EMPLOYED'].abs()
# Perform T-test
t_stat, p_val_emp = ttest_ind(defaulted_emp, non_defaulted_emp, equal_var=False)
print("T-Test for Employment Length and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val_emp:.4f}")
# Visualization: Box Plot for Employment Length by Default Status
plt.figure(figsize=(12, 6))
sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=appl_train, palette='pastel')
plt.title('Box Plot of Employment Length by Default Status')
plt.xlabel('Default Status (0 = No Default, 1 = Default)')
plt.ylabel('Employment Length (Days)')
plt.xticks(ticks=[0, 1], labels=['Non-Defaulted', 'Defaulted'])
plt.tight_layout()
plt.show()
# Interpretation
if p_val_emp < 0.05:
print("Result: Reject the null hypothesis (significant difference in employment length between groups).")
else:
print("Result: Fail to reject the null hypothesis (no significant difference in employment length).")
T-Test for Employment Length and Default Likelihood T-Statistic: -37.4471, P-Value: 0.0000
<ipython-input-290-5982b0c19250>:11: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=appl_train, palette='pastel')
Result: Reject the null hypothesis (significant difference in employment length between groups).
Hypothesis Conclusion:¶
The T-Test for Employment Length and Default Likelihood yielded a T-Statistic of -37.4471 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).
This indicates that there is a significant difference in employment length between users who default and those who do not. Therefore, we conclude that users with shorter employment durations are indeed more likely to default.
7.5 Number of Children and Default Likelihood¶
Hypothesis: Users with more children are more likely to default.¶
Null Hypothesis (H0):
There is no association between the number of children and default likelihood.
Alternative Hypothesis (H1):
There is an association between the number of children and default likelihood.
Test:
Chi-Square Test
Reason for chosing this Test:
The Chi-Square Test for Independence is appropriate for assessing the relationship between two categorical variables—in this case, the number of children and default status—to determine if the distribution of defaults varies based on the number of children users have.
children_contingency = pd.crosstab(appl_train['CNT_CHILDREN'], appl_train['TARGET'])
# Perform Chi-Square test
chi2, p_val_children, dof, expected = chi2_contingency(children_contingency)
print("Chi-Square Test for Number of Children and Default Likelihood")
print(f"Chi2 Statistic: {chi2:.4f}, P-Value: {p_val_children:.4f}")
# Visualization: Bar Plot for Number of Children by Default Status
plt.figure(figsize=(12, 6))
children_contingency.plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
plt.xlim(0,6)
plt.title('Default Status by Number of Children')
plt.xlabel('Number of Children')
plt.ylabel('Count')
plt.xticks(rotation=0) # Keep x-axis labels horizontal
plt.legend(title='Default Status', labels=['Non-Defaulted (0)', 'Defaulted (1)'])
plt.tight_layout()
# Interpretation
if p_val_children < 0.05:
print("Result: Reject the null hypothesis (significant association between number of children and default likelihood).")
else:
print("Result: Fail to reject the null hypothesis (no significant association).")
Chi-Square Test for Number of Children and Default Likelihood Chi2 Statistic: 185.4518, P-Value: 0.0000 Result: Reject the null hypothesis (significant association between number of children and default likelihood).
<Figure size 1200x600 with 0 Axes>
Hypothesis Conclusion:¶
The Chi-Square Test for Number of Children and Default Likelihood yielded a Chi2 Statistic of 185.4518 and a P-Value of 0.0000. Since the P-Value is significantly lower than the common significance level of 0.05, we reject the null hypothesis (H0).
This indicates that there is a significant association between the number of children users have and their likelihood of defaulting on loans. Therefore, we conclude that users with more children are indeed more likely to default. This finding highlights the potential impact of family size on financial stability and default risk.
7.6 Annuity Amount and Loan Default¶
Hypothesis: Higher annuity amounts are associated with a higher likelihood of default.¶
Null Hypothesis (H0):
There is no significant difference in annuity amount between users who default and those who don’t.
Alternative Hypothesis (H1):
There is a significant difference in annuity amount between users who default and those who don’t.
Test:
Independent Samples T-Test on AMT_ANNUITY.
Reason for chosing this Test:
The Independent Samples T-Test is suitable for comparing the means of two independent groups users who default and those who do not—to determine if there is a significant difference in employment length between the two groups.
# Separate annuity data for defaulted and non-defaulted users
defaulted_annuity = appl_train[appl_train['TARGET'] == 1]['AMT_ANNUITY']
non_defaulted_annuity = appl_train[appl_train['TARGET'] == 0]['AMT_ANNUITY']
# Perform T-test
t_stat, p_val_annuity = ttest_ind(defaulted_annuity.dropna(), non_defaulted_annuity.dropna(), equal_var=False)
print("T-Test for Annuity Amount and Default Likelihood")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_val_annuity:.4f}")
plt.figure(figsize=(8, 5))
sns.histplot(data=appl_train, x='AMT_ANNUITY', hue='TARGET', bins=50, kde=True, palette='coolwarm')
plt.title("Annuity Amount Distribution by Default Status")
plt.xlim(0,100000)
plt.xlabel("Annuity Amount")
plt.ylabel("Frequency")
plt.legend(title='Default Status', labels=['Non-Defaulted (0)', 'Defaulted (1)'])
plt.tight_layout()
# Interpretation
if p_val_annuity < 0.05:
print("Result: Reject the null hypothesis (significant difference in annuity amount between groups).")
else:
print("Result: Fail to reject the null hypothesis (no significant difference in annuity amount).")
T-Test for Annuity Amount and Default Likelihood T-Statistic: -8.1473, P-Value: 0.0000 Result: Reject the null hypothesis (significant difference in annuity amount between groups).
Hypothesis Conclusion:¶
The T-Test for Annuity Amount and Default Likelihood resulted in a T-Statistic of -8.1473 and a P-Value of 0.0000. Since the P-Value is significantly less than the standard significance level of 0.05, we reject the null hypothesis (H0).
This indicates a significant difference in annuity amounts between users who default and those who do not. Therefore, we conclude that higher annuity amounts are indeed associated with a higher likelihood of default, suggesting that annuity amounts are an important factor in evaluating the risk of default among users.
Feature Creation and Feature Engineering¶
Feature engineering is a critical step in machine learning and data analysis where raw data is transformed into meaningful features. This process improves model performance by capturing the underlying patterns in the data, providing the model with structured, relevant inputs for accurate predictions.
How Feature Engineering Helped¶
Feature engineering transformed a complex, fragmented dataset into a structured, unified format that directly addressed our analytical goals:
- Captured Critical Metrics: Aggregated data revealed total amounts financed, repayment behaviors, and historical loan performance.
- Generated Time-Based Insights: Vintage features highlighted loan durations and credit history trends.
- Identified Risk Profiles: Delinquency features provided a clear picture of a client’s payment history, offering insights into their likelihood of default.
- Simplified Modeling: Consolidated data ensured models could train on a clean, enriched dataset with meaningful inputs.
Implementation of Feature Engineering¶
We created three primary categories of features: Amount Financed, Vintage (Time Duration), and Delinquency. These were aggregated from multiple tables using statistical functions such as max, min, mean, and sum. In total, we engineered 600+ custom aggregated features, enabling a comprehensive understanding of client behaviors.
1. Amount Financed¶
This category focuses on the actual amount of money clients borrowed across various loans. It includes around 120 features aggregated from different tables. Below are some examples:
POS_CASH_balance.csv: Aggregated monthly balances of POS (point of sale) and cash loans:
- TOTAL_POS_AMT: Total of all POS cash balances.
- AVG_POS_AMT: Average POS balance across months.
credit_card_balance.csv: Monthly balances of credit card loans:
- MAX_CREDIT_CARD_AMT: Maximum credit card balance.
- RATIO_CREDIT_USED: Ratio of credit used to credit limit, which helped quantify spending behavior.
installments_payments.csv: Repayment details:
- SUM_INSTALLMENT_PAID: Total repayment amount, which indicated the total financial commitment of the client.
- AVG_INSTALLMENT_PAID: Average installment payment, useful for gauging consistency in repayments.
👉 Explore the notebook for Amount Financed Features
2. Vintage (Time Duration)¶
Vintage features represent performance metrics of loans and accounts over time. They track clients’ financial stability and credit history trends. This category includes 200+ features. Examples include:
bureau.csv: Loan durations from external institutions:
- MAX_LOAN_DURATION: Maximum duration of loans, highlighting long-term financial commitments.
- AVG_LOAN_DURATION: Mean loan duration, useful for assessing average borrowing patterns.
previous_application.csv: Time gaps and durations of past Home Credit applications:
- MIN_TIME_SINCE_LAST_APPLICATION: Shortest gap between previous loans, indicating rapid borrowing behavior.
- AVG_LOAN_VINTAGE: Average duration of previous loans, which revealed financial maturity over time.
👉 Explore the notebook for Vintage Features
3. Delinquency¶
Delinquency features capture past repayment performance, highlighting the success and failures in honoring loans. This category includes 360+ features derived from tables such as:
bureau.csv: Delinquency patterns:
- COUNT_LOANS_DELINQUENT: Number of delinquent loans, a strong indicator of risk.
- MAX_DELINQUENCY_DAYS: Maximum days past due, revealing the severity of delinquencies.
bureau_balance.csv: Monthly bureau balance data:
- SUM_DELINQUENT_MONTHS: Total months of delinquency, summarizing the extent of missed payments.
- RATIO_DELINQUENT_MONTHS: Ratio of delinquent months to total credit history, normalizing delinquency across credit duration.
👉 Explore the notebook for Delinquency Features
Aggregation at SK_ID_CURR Level¶
To ensure that each client is represented by a single row, all features were aggregated at the SK_ID_CURR level. The aggregation process involved:
- Statistical Functions:
max,min,mean, andsumto summarize data. - Ratios: Relationships such as RATIO_AMT_CREDIT_TO_INCOME were derived for added insights.
- Time-Based Calculations: Differences between key dates were used to generate durations like loan vintage and repayment delays.
Top Features Selection¶
Selecting the most relevant features is a crucial step in building efficient and interpretable machine learning models. By reducing the dimensionality of the dataset, we can focus on the features that have the highest impact on the predictive performance, improve computational efficiency, and reduce overfitting.
In this project, we employed three key techniques for feature selection to identify the top features from our baseline models.
1. Lasso Feature Selection¶
Lasso (Least Absolute Shrinkage and Selection Operator) regression was used to select features by applying L1 regularization. Lasso assigns weights to features and shrinks less important feature weights to zero, effectively performing feature selection.
2. Recursive Feature Elimination (RFE)¶
Recursive Feature Elimination iteratively trains the model and removes the least important features based on the model’s performance. By repeating this process, RFE selects the optimal set of features.
3. Random Forest Feature Importance¶
Random Forest, an ensemble learning method, was used to evaluate feature importance based on the decrease in impurity when features are split. This technique is robust to overfitting and can handle nonlinear relationships.
Machine Learning Methodology and Architecture - Model Stacking¶
Based on the insights gained from exploratory data analysis and feature engineering, we adopted a robust and systematic methodology for prediction tasks, leveraging the power of model stacking and neural networks to handle the complexity and imbalance in the dataset
Model Stacking in Machine Learning¶
What is Model Stacking?¶
Model stacking is an ensemble learning technique where predictions from multiple base models (called "level-0 models" or baseline models) are combined using another model, known as a meta-model (or "level-1 model"), to make the final predictions.
How it Works¶
- Train Base Models: Multiple models (e.g., decision trees, neural networks, or SVMs) are trained on the dataset.
- Generate Predictions: The base models make predictions on a validation set or unseen data.
- Train Meta-Model: The meta-model is trained using the predictions from the base models as input features, learning how to best combine these predictions.
Why Stacking is Used¶
- Leverage Model Diversity: Different models capture different patterns or features in the data.
- Reduce Overfitting: Combining predictions reduces the likelihood of overfitting compared to individual models.
- Improve Generalization: The meta-model learns from the weaknesses of the base models and makes better predictions on unseen data.
Benefits of Stacking¶
- Better Generalization: The meta-model reduces bias and variance errors by learning how to combine base model outputs.
- Flexibility: A wide variety of models (simple or complex) can be used in the stack.
- Improved Accuracy: Stacking often outperforms individual models and simpler ensemble methods like bagging or boosting.
By allowing the ensemble to learn how to effectively weigh different base models, stacking helps create better generalized models.
Figure Below shows the stacking Architecture¶
Amount Finance Baseline Models (afm1 and afm2)¶
We have attached the links to individual baseline model codes
After attempting several tree based and boosting models like Random Forest, LightGBM, XGboost and getting sub par results, we decided to proceed with Artifical Neural networks for both the models.
afm1 - AMTF_POS_CASH_MODEL¶
This model was built on the amount finance features built using POS_CASH_Balance.csv and installments.csv merged with the target values from application train aggregated at SK_ID_CURR level.
Classification Score¶
afm2 - AMTF_CC_MODEL¶
- afm2 link - https://github.com/kautilyaa/TRUST/blob/main/Baseline_Models/SM_CC_installments_model.ipynb
This model was built on the amount finance features built using credit_card_balance.csv and installments.csv merged with the target values from application train aggregated at SK_ID_CURR level.
Classification Report¶
Deliquency Baseline Model (bm1)¶
We have attached the links to individual baseline model codes
We intially tried linear classifiers like Logistic Regression and SVM classifier without much luck. Later realised that many of the columns have high correlation which was an added problem along with imbalance target variable. LightGBM gave decent results but the end decision was to go with ANN as before. We tried several differal combinations of hidden layers but went with the architecture given below. Got a decent Test accuracy of about 73% which I deemed good for a weak learner.
bm1 - DEQ_BASELINE_MODEL¶
The model was built using features with 2 levels of aggregation utilizing the tables bureau.csv and bureau_balance.csv.
Classification Report:¶
Vintage Baseline Models (vm1 and vm2)¶
We have attached the links to individual baseline model codes
After attempting several tree based and boosting models like Random Forest, LightGBM, XGboost and getting sub par results, we decided to proceed with Artifical Neural networks for both the models.
vm1 - VIM_BUREAU_MODEL¶
- vm1 link - {modellink}
Modelwas built using a combination of aggregated vintage features from bureau.csv and application_train.
Classification Report:¶
vm2 - VIM_PREV_MODEL¶
- vm2 link - {modellink}
Model was built using a combination of aggregated vintage features from previous_application.csv and application_train.
Classification Report:¶
Final Dataset Preparation¶
The top features from each of the feature groups along with the residuals and predicted probabilities from each baseline model is INNER JOINED with application_train to create the final dataset.
Pyspark code link - https://github.com/kautilyaa/TRUST/blob/main/Feature_Selection/Meta%20Model%20Dataset%20Prep.ipynb
# appl_train = spark.table('hive_metastore.default.application_train')
# #vintage features
# vin_appl_train = spark.table('hive_metastore.default.vin_finalized_application_train')
# vin_prev = spark.table('hive_metastore.default.ay_vin_prev_pred')
# vin_bur = spark.table('hive_metastore.default.ay_vin_bur_pred')
# #amtf features
# amtf_pos_cash = spark.table('hive_metastore.default.amtf_pos_cash_final_prediction_data2')
# amtf_cc = spark.table('hive_metastore.default.amtf_cc_final_pred_df')
# #deq features
# deq_df = spark.table('hive_metastore.default.bu_final_prediction_data2')
# #Dropping TARGET column from vin_appl_train
# vin_appl_train = vin_appl_train.drop("TARGET")
# vin_appl_train.limit(2).display()
# amtf_cc = amtf_cc.withColumnRenamed("Prediction_Probability_pos_cash", "Prediction_Prob_cc")
# # Perform left outer joins
# result_df = appl_train.join(vin_appl_train, on="SK_ID_CURR", how="left") \
# .join(vin_prev, on="SK_ID_CURR", how="left") \
# .join(vin_bur, on="SK_ID_CURR", how="left") \
# .join(amtf_pos_cash, on="SK_ID_CURR", how="left") \
# .join(amtf_cc, on="SK_ID_CURR", how="left") \
# .join(deq_df, on="SK_ID_CURR", how="left")
# # Display the final result
# result_df.limit(2).display()
Final Model Curation(Meta Model)¶
final_data_df = pd.read_csv('/content/drive/MyDrive/UMD/DATA602_DATA_Science/Project/Data Segregator /Code/FinalData/ultimate_op_dataset.csv')
missing_threshold = 0.5
missing_data = final_data_df.isnull().mean()
columns_to_drop = missing_data[missing_data > missing_threshold].index
final_data_df = final_data_df.drop(columns=columns_to_drop)
len(columns_to_drop)
55
final_data_df.shape
(307511, 123)
Impute missing values for numeric columns with median and categorical with¶
Replaces NaN values:
- For numerical columns: Based on the column's minimum value.
- If min value is 0, replace NaN with 10 - 1 = 9.
- If min value is negative, replace NaN with the nearest power of 10 + 1.
- For categorical columns: Replace NaN with "UNK".
Parameters:
data (pd.DataFrame): Input DataFrame.
Returns:
pd.DataFrame: DataFrame with NaN values replaced.
def handle_nan(data):
for column in data.columns:
if data[column].dtype in ['float64', 'int64']:
# Numerical column
min_val = data[column].min(skipna=True)
if min_val == 0:
replacement_value = - 1
elif min_val < 0:
nearest_negative_power = -10 ** (np.ceil(np.log10(abs(min_val)))+0)
replacement_value = nearest_negative_power*10 + 1
else:
# If min_val > 0 (unexpected), fallback to a default replacement
replacement_value = -10 # Default for unforeseen cases
# Replace NaN with the calculated replacement value
# data[column].fillna(replacement_value, inplace=True)
data[column] = data[column].fillna(replacement_value)
elif data[column].dtype == 'object':
# Categorical column: Replace NaN with "UNK"
# data[column].fillna("UNK", inplace=True)
data[column] = data[column].fillna("UNK")
return data
# Apply the function to your dataset
final_data_df = handle_nan(final_data_df)
# Validate the changes
print("Missing Values After Handling:")
print(final_data_df.isnull().sum())
Missing Values After Handling:
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
..
b_credit_AMT_CREDIT_DEBT_DIFF_mean 0
b_credit_DAYS_CREDIT_mean 0
b_active_DAYS_CREDIT_mean 0
DEQ_AVG_COUNT_DPD0P_36MOB_ALL 0
DEQ_AVG_COUNT_DPD0P_3MOB_ALL 0
Length: 123, dtype: int64
final_data_df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | b_DAYS_CREDIT_mean | b_720_DAYS_CREDIT_PLAN_sum | b_consumer_DAYS_CREDIT_ENDDATE_mean | b_credit_AMT_CREDIT_SUM_DEBT_sum | b_CNT_CREDIT_PROLONG_sum | b_credit_AMT_CREDIT_DEBT_DIFF_mean | b_credit_DAYS_CREDIT_mean | b_active_DAYS_CREDIT_mean | DEQ_AVG_COUNT_DPD0P_36MOB_ALL | DEQ_AVG_COUNT_DPD0P_3MOB_ALL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100011 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 1019610.0 | 33826.5 | ... | -1773.000000 | -999999.0 | -1437.666667 | 0.0 | 0.0 | -54000.0 | -1960.00 | -99999.00 | 0.0 | 0.0 |
| 1 | 100023 | 0 | Cash loans | F | N | Y | 1 | 90000.0 | 544491.0 | 17563.5 | ... | -1164.384615 | 3652.0 | -264.333333 | 41881.5 | 0.0 | -47236.5 | -1275.25 | -1141.25 | 0.0 | 0.0 |
| 2 | 100036 | 0 | Cash loans | F | N | Y | 0 | 112500.0 | 512064.0 | 25033.5 | ... | -889.000000 | -999999.0 | -462.000000 | -99999999.0 | 0.0 | -999999999.0 | -99999.00 | -889.00 | 0.0 | 0.0 |
| 3 | 100043 | 0 | Cash loans | F | N | Y | 2 | 198000.0 | 641173.5 | 23157.0 | ... | -1904.000000 | -999999.0 | -2453.000000 | 0.0 | 0.0 | -306000.0 | -1169.00 | -99999.00 | 0.0 | 0.0 |
| 4 | 100047 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 1193580.0 | 35028.0 | ... | -1104.800000 | 2557.0 | 701.666667 | 692253.0 | 0.0 | 62253.0 | -956.00 | -857.00 | 0.0 | 0.0 |
5 rows × 123 columns
# 2. Handle Outliers
# Cap outliers at 1st and 99th percentiles for numeric columns
numeric_cols = final_data_df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_cols:
lower_bound = final_data_df[col].quantile(0.05)
upper_bound = final_data_df[col].quantile(0.95)
final_data_df[col] = np.clip(final_data_df[col], lower_bound, upper_bound)
final_data_df.describe(include='all')
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | b_DAYS_CREDIT_mean | b_720_DAYS_CREDIT_PLAN_sum | b_consumer_DAYS_CREDIT_ENDDATE_mean | b_credit_AMT_CREDIT_SUM_DEBT_sum | b_CNT_CREDIT_PROLONG_sum | b_credit_AMT_CREDIT_DEBT_DIFF_mean | b_credit_DAYS_CREDIT_mean | b_active_DAYS_CREDIT_mean | DEQ_AVG_COUNT_DPD0P_36MOB_ALL | DEQ_AVG_COUNT_DPD0P_3MOB_ALL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 307511.000000 | 3.075110e+05 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 307511.000000 | 3.075110e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278183.072462 | 0.080729 | NaN | NaN | NaN | NaN | 0.400509 | 163167.915380 | 5.846019e+05 | 26598.793372 | ... | -15247.987454 | -336709.799864 | -200052.930561 | -4.755879e+07 | -0.143149 | -4.762598e+08 | -44591.207384 | -29846.247700 | -0.100255 | -0.133198 |
| std | 101331.601166 | 0.272419 | NaN | NaN | NaN | NaN | 0.664724 | 72517.568752 | 3.563103e+05 | 12481.993515 | ... | 34644.575222 | 476031.173670 | 399518.102108 | 5.000294e+07 | 0.350226 | 4.993877e+08 | 49169.245572 | 45255.788575 | 0.393721 | 0.356365 |
| min | 117945.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 67500.000000 | 1.350000e+05 | 9000.000000 | ... | -99999.000000 | -999999.000000 | -999999.000000 | -1.000000e+08 | -1.000000 | -1.000000e+09 | -99999.000000 | -99999.000000 | -1.000000 | -1.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 112500.000000 | 2.700000e+05 | 16524.000000 | ... | -1719.477273 | -999999.000000 | -1687.000000 | -1.000000e+08 | 0.000000 | -1.000000e+09 | -99999.000000 | -99999.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 147150.000000 | 5.135310e+05 | 24903.000000 | ... | -1171.250000 | 730.000000 | -720.333333 | 0.000000e+00 | 0.000000 | -3.331350e+05 | -1791.750000 | -801.666667 | 0.000000 | 0.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 202500.000000 | 8.086500e+05 | 34596.000000 | ... | -734.633333 | 2424.000000 | -136.571429 | 6.300000e+04 | 0.000000 | -4.912992e+04 | -757.000000 | -390.000000 | 0.000000 | 0.000000 |
| max | 438427.500000 | 1.000000 | NaN | NaN | NaN | NaN | 2.000000 | 337500.000000 | 1.350000e+06 | 53325.000000 | ... | -262.000000 | 29273.000000 | 674.000000 | 4.267385e+05 | 0.000000 | 0.000000e+00 | -227.000000 | -120.000000 | 0.583333 | 0.166667 |
11 rows × 123 columns
binary_map = {'Y': 1, 'N': 0}
binary_cols = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
for col in binary_cols:
if col in final_data_df.columns:
final_data_df[col] = final_data_df[col].map(binary_map)
# Convert other categorical variables into one-hot encoding
categorical_cols = final_data_df.select_dtypes(include=['object']).columns
final_data_df = pd.get_dummies(final_data_df, columns=categorical_cols, drop_first=True)
#get residual and probablity column and add to exclude list
prob_data_col_name = [col for col in final_data_df.columns if 'prob' in col or 'Probability' in col]
res_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
#get numerical features
numeric_cols = final_data_df.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = numeric_cols.difference(['TARGET', 'SK_ID_CURR']) # Exclude these columns
numeric_cols = numeric_cols.difference(prob_data_col_name)
numeric_cols = numeric_cols.difference(res_data_col_name)
numeric_cols
Index(['AMT_ANNUITY', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'AMT_INCOME_TOTAL',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_YEAR',
...
'vin_avg_time_between_loans_min',
'vin_days_since_credit_update_min_max',
'vin_days_since_first_loan_taken_max_sum',
'vin_days_since_last_document_update_max',
'vin_last_loan_cash_loans_loan_sum',
'vin_last_loan_consumer_loans_loan_sum',
'vin_months_since_last_approved_max_sum', 'vin_num_active_loans_sum',
'vin_total_loan_amount_consumer_loans_min_sum',
'vin_total_loan_amount_xna_std_sum'],
dtype='object', length=102)
scaler = StandardScaler()
scaled_numeric_data = pd.DataFrame(
scaler.fit_transform(final_data_df[numeric_cols]),
columns=numeric_cols,
index=final_data_df.index
)
# Drop the original non-scaled numeric columns and concatenate the scaled ones
remainder_data = final_data_df.drop(columns=numeric_cols)
final_data_df = pd.concat([scaled_numeric_data, remainder_data], axis=1)
final_data_df.describe(include='all')
| AMT_ANNUITY | AMT_CREDIT | AMT_GOODS_PRICE | AMT_INCOME_TOTAL | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_YEAR | ... | ORGANIZATION_TYPE_Trade: type 6 | ORGANIZATION_TYPE_Trade: type 7 | ORGANIZATION_TYPE_Transport: type 1 | ORGANIZATION_TYPE_Transport: type 2 | ORGANIZATION_TYPE_Transport: type 3 | ORGANIZATION_TYPE_Transport: type 4 | ORGANIZATION_TYPE_University | ORGANIZATION_TYPE_XNA | EMERGENCYSTATE_MODE_UNK | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | 3.075110e+05 | ... | 307511 | 307511 | 307511 | 307511 | 307511 | 307511 | 307511 | 307511 | 307511 | 307511 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | False | False | False | False | False | False | False | False | False | False |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 306880 | 299680 | 307310 | 305307 | 306324 | 302113 | 306184 | 252137 | 161756 | 305183 |
| mean | -5.402242e-17 | 4.477992e-17 | 3.068511e-17 | -2.195094e-16 | -1.878076e-16 | -1.878076e-16 | -3.660031e-17 | -6.210961e-17 | -1.878076e-16 | 3.844881e-17 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | 1.000002e+00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | -1.409937e+00 | -1.261829e+00 | -1.188593e+00 | -1.319240e+00 | -2.531109e+00 | -2.531109e+00 | -1.912958e+00 | -1.605835e+00 | -2.531109e+00 | -1.360678e+00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | -8.071475e-01 | -8.829451e-01 | -8.754227e-01 | -6.986996e-01 | 3.950837e-01 | 3.950837e-01 | -1.383378e-02 | -1.241621e-01 | 3.950837e-01 | -7.988704e-01 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | -1.358594e-01 | -1.994638e-01 | -2.354655e-01 | -2.208836e-01 | 3.950837e-01 | 3.950837e-01 | -1.383378e-02 | -1.241621e-01 | 3.950837e-01 | -2.370632e-01 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 6.407005e-01 | 6.288016e-01 | 4.589562e-01 | 5.423810e-01 | 3.950837e-01 | 3.950837e-01 | -1.383378e-02 | -1.241621e-01 | 3.950837e-01 | 8.865512e-01 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 2.141184e+00 | 2.148126e+00 | 2.351596e+00 | 2.404002e+00 | 3.950837e-01 | 3.950837e-01 | 1.885290e+00 | 2.839183e+00 | 3.950837e-01 | 2.010165e+00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 rows × 226 columns
# Compute the correlation matrix
correlation_matrix = final_data_df.corr()
# Identify highly correlated variables (correlation > 0.9)
high_corr_vars = correlation_matrix[(correlation_matrix > 0.9) & (correlation_matrix != 1.0)].stack().index
# Extract the second column (potentially redundant features)
redundant_features = set([col[1] for col in high_corr_vars])
# Define columns to protect (like "residual" and "prob")
# protected_features = [col for col in final_data_df.columns if 'residual' in col or 'prob' in col]
protected_features = [col for col in final_data_df.columns if 'residual' in col]
# Remove only redundant features that are NOT protected and probablity rows
features_to_drop = [col for col in redundant_features if col not in protected_features or 'prob' in col]
# Drop the redundant features from the dataset
final_data_df_select = final_data_df.drop(columns=features_to_drop, errors='ignore')
# Validate the shape of the new dataset
print(f"Original Dataset Shape: {final_data_df.shape}")
print(f"Reduced Dataset Shape: {final_data_df_select.shape}")
# Plot the correlation matrix
plt.figure(figsize=(12, 10))
plt.title("Correlation Matrix of Selected Features", fontsize=16)
plt.imshow(correlation_matrix, cmap='mako', interpolation='nearest')
plt.colorbar()
plt.show()
Original Dataset Shape: (307511, 226) Reduced Dataset Shape: (307511, 190)
8.6. Model Training¶
# Splitting the data
X = final_data_df_select.drop(columns=['TARGET', 'SK_ID_CURR']) # Features
y = final_data_df_select['TARGET'] # Target
y = y.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Taking here the standard values but we tried other values they all were similar
# Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]
# Evaluation
print(classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_proba))
precision recall f1-score support
0 0.98 1.00 0.99 56539
1 1.00 0.77 0.87 4964
accuracy 0.98 61503
macro avg 0.99 0.88 0.93 61503
weighted avg 0.98 0.98 0.98 61503
AUC-ROC: 0.9631073900640831
Model Performance Analysis¶
Overall Performance¶
- Accuracy: The model achieves an impressive accuracy of 0.98, meaning it correctly classifies 98% of the samples. However, with an imbalanced dataset (as mentioned earlier), accuracy can be misleading.
- AUC-ROC: The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is 0.96, indicating excellent discriminatory power. This means the model is very good at distinguishing between the two classes (0 and 1).
Class-wise Performance¶
Class 0 (Loan Repaid)¶
- Precision: 0.98 - For samples predicted as class 0, the model is correct 98% of the time.
- Recall: 1.00 - The model correctly identifies almost all actual class 0 samples.
- F1-Score: 0.99 - A high F1-score reflects a good balance between precision and recall for class 0.
Class 1 (Loan Defaulted)¶
- Precision: 1.00 - For samples predicted as class 1, the model is correct 100% of the time. This indicates a very low rate of false positives for this class.
- Recall: 0.77 - The model correctly identifies 77% of the actual class 1 samples. This might be an area for improvement.
- F1-Score: 0.87 - While lower than the F1-score for class 0, it is still relatively high, suggesting a decent balance between precision and recall for class 1.
Support¶
- Class 0 (Loan Repaid): 56,539 samples.
- Class 1 (Loan Defaulted): 4,964 samples.
- The dataset imbalance is evident, with significantly more samples in class 0 than in class 1.
Macro and Weighted Averages¶
Macro Average:
- Calculates the average of precision, recall, and F1-score across both classes, treating them equally.
Weighted Average:
- Calculates the weighted average of precision, recall, and F1-score, considering the number of samples in each class. It gives more weight to the majority class (class 0 in this case).
Key Insights and Potential Improvements¶
High Performance but Imbalanced:
- The model demonstrates strong overall performance, particularly in identifying loan repayments (class 0).
- However, the recall for loan defaults (class 1) is relatively lower, suggesting some defaults might be missed.
Focus on Recall for Class 1:
- Given the importance of identifying potential loan defaults, consider strategies to improve the recall for class 1.
- This could involve:
- Adjusting the model's classification threshold.
- Using different algorithms.
- Employing techniques to handle class imbalance, such as oversampling, undersampling, or cost-sensitive learning.
Trade-off Between Precision and Recall:
- Explore the trade-off between precision and recall, especially for class 1.
- A lower precision might be acceptable if it leads to a significant increase in recall, as identifying more potential defaults is crucial.
Further Analysis:
- Analyze misclassified samples to understand the characteristics of cases where the model makes errors.
- This can provide insights into potential biases or areas for improvement.
Conclusion¶
- The model shows promising results with high accuracy and AUC-ROC.
- However, given the imbalanced dataset and the importance of identifying loan defaults, focusing on improving the recall for class 1 is recommended.
- This can be achieved through the techniques mentioned above.
- Further analysis of misclassified samples can provide valuable insights for model refinement.
Visualizations¶
pred_proba = logreg.predict_proba(X_test)[:, 1]
pred_proba
array([7.24117649e-03, 4.52342533e-02, 8.10559343e-03, ...,
4.99365034e-02, 8.86918137e-13, 2.61316840e-02])
# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
np.arange(len(pred_proba)), # Use the index as x-axis
pred_proba, # Use predicted probabilities as y-axis
c=y_test, # Color points by the actual values
cmap='coolwarm', # Color map (red/blue for binary classification)
alpha=0.7, # Transparency
# edgecolor='k' # Add a black border to points for better visibility
)
# Add colorbar for clarity
plt.colorbar(scatter, label='Actual Values (y_test)')
# Plot settings
plt.title('Predicted Probabilities Colored by Actual Values', fontsize=16)
plt.xlabel('Index', fontsize=14)
plt.ylabel('Predicted Probabilities', fontsize=14)
plt.grid(alpha=0.3)
plt.show()
# Sort the predicted probabilities and corresponding y_test
sorted_indices = np.argsort(pred_proba)
# # Sort the predicted probabilities and corresponding y_test
sorted_pred_proba = pred_proba[sorted_indices]
sorted_y_test = y_test.iloc[sorted_indices]
# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
np.arange(len(sorted_pred_proba)), # Use the index of sorted predictions as x-axis
sorted_pred_proba, # Use sorted predicted probabilities as y-axis
c=sorted_y_test, # Color points based on the sorted actual values
cmap='coolwarm', # Color map (red/blue for binary classification)
alpha=0.7, # Transparency
# edgecolor='k' # Add a black border to points for better visibility
)
# Add colorbar for clarity
plt.colorbar(scatter, label='Actual Values (y_test)')
# Plot settings
plt.title('Sorted Predicted Probabilities Colored by Actual Values', fontsize=16)
plt.xlabel('Sorted Index', fontsize=14)
plt.ylabel('Predicted Probabilities', fontsize=14)
plt.grid(alpha=0.3)
plt.show()
pred_proba = logreg.predict_proba(X_train)[:, 1]
quantiles = [0.10, 0.25, 0.50, 0.625, 0.85, 0.90]
# Calculate the quantiles using numpy.quantile()
quantile_values = np.quantile(pred_proba, quantiles)
# Print the results
for quantile, value in zip(quantiles, quantile_values):
print(f"{quantile*100:.0f}th percentile: {value:.4f}")
10th percentile: 0.0020 25th percentile: 0.0095 50th percentile: 0.0259 62th percentile: 0.0368 85th percentile: 0.0842 90th percentile: 0.1342
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
# Adjust thresholds to split the range into 6 equal intervals
adjusted_thresholds = quantile_values # 6 classes => 7 boundary points
# Assign risk levels (R1 to R6) based on the adjusted thresholds
risk_labels = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6']
risk_levels = np.digitize(y_pred_proba, adjusted_thresholds, right=True)
risk_levels = np.clip(risk_levels, 0, len(risk_labels) - 1) # Ensure valid index range
# Map risk levels to labels
# Flatten risk_levels to a 1D array before using as index
assigned_risks = [risk_labels[level] for level in risk_levels.flatten()]
# Plot the probabilities with their assigned risk levels based on equal intervals
plt.figure(figsize=(10, 6))
# Scatter plot for probabilities and their risk levels
plt.scatter(y_pred_proba, risk_levels + 1, color='blue', label='Predicted Probabilities', s=100)
# Label each point with its risk level
# for prob, level in zip(y_pred_proba, assigned_risks):
# plt.text(prob, risk_labels.index(level) + 1.1, f"{level}", fontsize=10, ha='center', color='red')
# Add threshold lines
for threshold, label in zip(adjusted_thresholds, risk_labels):
plt.axvline(threshold, color='gray', linestyle='--', alpha=0.7)
plt.text(threshold, 6.5, label, rotation=90, verticalalignment='center', color='green')
# Configure plot
plt.title("Predicted Probabilities and Risk Levels (Equal Intervals)", fontsize=14)
plt.xlabel("Predicted Probability", fontsize=12)
plt.ylabel("Risk Level (R1 to R6)", fontsize=12)
plt.yticks(range(1, 7), risk_labels)
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(["Risk Thresholds", "Predicted Probabilities"])
plt.show()
# Calculate the count of each risk level
risk_level_counts = pd.Series(assigned_risks).value_counts()
# Bar plot for the counts of each risk level
plt.figure(figsize=(8, 5))
risk_level_counts.sort_index().plot(kind="bar", color="orange", edgecolor="black")
# Customize the plot
plt.title("Counts of Each Risk Level", fontsize=14)
plt.xlabel("Risk Level", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()
# Here taking one random point from test_Data
random_index = np.random.randint(0, len(X_test))
random_index
8637
random_data_point = X_test.iloc[random_index]
random_data_point
| 224919 | |
|---|---|
| AMT_ANNUITY | -1.409937 |
| AMT_INCOME_TOTAL | -1.31924 |
| AMT_REQ_CREDIT_BUREAU_MON | -0.013834 |
| AMT_REQ_CREDIT_BUREAU_QRT | -0.124162 |
| AMT_REQ_CREDIT_BUREAU_YEAR | -0.237063 |
| ... | ... |
| ORGANIZATION_TYPE_Transport: type 3 | False |
| ORGANIZATION_TYPE_Transport: type 4 | False |
| ORGANIZATION_TYPE_University | False |
| EMERGENCYSTATE_MODE_UNK | False |
| EMERGENCYSTATE_MODE_Yes | False |
188 rows × 1 columns
residual_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
column_values = random_data_point[residual_data_col_name]
column_values
| 224919 | |
|---|---|
| residual_prev | 0.498647 |
| residual_bur | 0.461597 |
| residual_posh_cash | 0.437515 |
| residual | 0.486345 |
# pass it to logreg
pred_val = logreg.predict_proba(random_data_point.values.reshape(1, -1))
print(pred_val)
# Get the predicted risk level index (0 to 5) from pred_val
predicted_risk_level_index = np.argmax(pred_val)
# Access risk_labels using the integer index
print(risk_labels[predicted_risk_level_index])
#assign the value of risk band
[[0.23959642 0.76040358]] R2
To pass Real-World Data¶
data_point_main_df = final_data_df.loc[X_test.index[random_index]]
data_point_main_df
| 224919 | |
|---|---|
| AMT_ANNUITY | -1.409937 |
| AMT_CREDIT | -1.135534 |
| AMT_GOODS_PRICE | -1.052432 |
| AMT_INCOME_TOTAL | -1.31924 |
| AMT_REQ_CREDIT_BUREAU_DAY | 0.395084 |
| ... | ... |
| ORGANIZATION_TYPE_Transport: type 4 | False |
| ORGANIZATION_TYPE_University | False |
| ORGANIZATION_TYPE_XNA | False |
| EMERGENCYSTATE_MODE_UNK | False |
| EMERGENCYSTATE_MODE_Yes | False |
226 rows × 1 columns
# get all proba features
prob_data_col_name = [col for col in final_data_df.columns if 'prob' in col or 'Probability' in col]
data_point_main_df_val = data_point_main_df[prob_data_col_name]
data_point_main_df_val
| 224919 | |
|---|---|
| pred_proba_prev | 0.501353 |
| pred_proba_bur | 0.538403 |
| Prediction_Probability_pos_cash | 0.562485 |
| Prediction_Probability | 0.510933 |
#get residual from pred_val
residual_data_col_name = [col for col in final_data_df.columns if 'residual' in col]
#create mapping of proba blity with corresponding residula
prob_residual_dict = {}
for i in range(len(prob_data_col_name)):
prob_residual_dict[prob_data_col_name[i]] = residual_data_col_name[i]
prob_residual_dict
{'pred_proba_prev': 'residual_prev',
'pred_proba_bur': 'residual_bur',
'Prediction_Probability_pos_cash': 'residual_posh_cash',
'Prediction_Probability': 'residual'}
def func_normalize(val):
if val < 0.5:
val=1-val
return 2*val-1
data_point_main_df_val = data_point_main_df[prob_data_col_name]
# based on dictionary replace the values of random_data_point residual with data_point_main_df[prob_data_col_name]
for key, value in prob_residual_dict.items():
random_data_point[value] = func_normalize(data_point_main_df_val[key])
# pass it to logreg
pred_val = logreg.predict_proba(random_data_point.values.reshape(1, -1))
print(pred_val)
# Get the predicted risk level index (0 to 5) from pred_val
predicted_risk_level_index = np.argmax(pred_val)
# Access risk_labels using the integer index
print(risk_labels[predicted_risk_level_index])
[[0.45976428 0.54023572]] R2
Conclusion¶
This project aimed to predict loan defaults using the Home Credit Default Risk dataset, addressing challenges such as high class imbalance and data complexity. Through a structured approach of feature engineering, model stacking, and targeted analysis, we achieved robust results with meaningful insights.
Key Steps and Insights¶
Feature Engineering:
- We created three logical feature groups:
- Amount Financed: Derived from POS cash, credit card balances, and installment payments.
- Delinquency: Captured loan repayment behavior using bureau and bureau balance tables.
- Vintage: Highlighted historical trends and loan consistency from previous applications.
- These groups enabled us to capture applicant-specific financial behavior effectively.
Example: Features like
SUM DELINQUENT MONTHSandRATIO CREDIT USEDprovided insights into past loan performance and credit utilization.- We created three logical feature groups:
Baseline Models and Stacking:
- Separate baseline models were trained on each feature group to avoid noise and collinearity.
- Outputs from the baseline models were stacked together with demographic features from the main application table.
- A logistic regression meta-model was trained on the stacked outputs, improving prediction accuracy.
Visualization:
- The approach diagram clearly shows how raw data was processed, features aggregated, and models stacked for final predictions.
Approach Diagram
Model Performance:
- Our stacked approach effectively handled class imbalance, ensuring that both classes (default and non-default) were predicted accurately.
Performance Metrics:
- Precision: 0.98 for class 0, 1.00 for class 1.
- Recall: 1.00 for class 0, 0.77 for class 1.
- AUC-ROC Score: 0.9631, demonstrating strong predictive capability.
Visuals:
- The confusion matrix and classification report highlight our model's ability to balance performance across both classes.
- Predicted probabilities show a clear separation between default (red) and non-default (blue) cases.
Confusion Matrix¶
Classification Report¶
Predicted vs Actual Values¶
Sorted Predicted Probabilities¶
- Insights into Risk Bands:
- Final outputs categorized users into risk bands (R1 to R6) for easier interpretation.
- Distribution of Risk Levels: Bands like R3 and R5 had higher loan defaults, providing actionable insights for risk management.
Risk_bands_distribution¶
Final Thoughts¶
This project highlights how thoughtful feature engineering, model stacking, and domain knowledge can uncover insights even in highly imbalanced datasets. By breaking down data into meaningful groups, training baseline models, and combining them into a meta-model, we achieved high accuracy and reliable predictions.
These results can help financial institutions make better lending decisions, reduce risk, and ensure responsible credit assessment.
For a detailed look at the code, methodology, and full analysis, please refer to our GitHub Repository.